# Accuracy Assessment of WOfS Product in Africa using Ground Truth Data  <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:** 
[ga_ls8c_wofs_2](https://explorer.digitalearth.africa/ga_ls8c_wofs_2),
[ga_ls8c_wofs_2_summary ](https://explorer.digitalearth.africa/ga_ls8c_wofs_2_summary),
[usgs_ls8c_level2_2]()

Notes:
* Landsat 8 collection 2 is confidential at continental level on 26 June 2020.
* This notebook should be run in Collection 2 Read Private Workspace should we need to run the Landsat 8 Collection 2 Sample dataset. 

## Background
The [Water Observations from Space (WOfS)](https://www.ga.gov.au/scientific-topics/community-safety/flood/wofs/about-wofs) is a derived product from Landsat 8 satellite observations as part of provisional Landsat 8 Collection 2 surface reflectance and shows surface water detected in Africa.
Individual water classified images are called Water Observation Feature Layers (WOFLs), and are created in a 1-to-1 relationship with the input satellite data. 
Hence there is one WOFL for each satellite dataset processed for the occurrence of water.

The data in a WOFL is stored as a bit field. This is a binary number, where each digit of the number is independantly set or not based on the presence (1) or absence (0) of a particular attribute (water, cloud, cloud shadow etc). In this way, the single decimal value associated to each pixel can provide information on a variety of features of that pixel. 
For more information on the structure of WOFLs and how to interact with them, see [Water Observations from Space](../Datasets/Water_Observations_from_Space.ipynb) and [Applying WOfS bitmasking](../Frequently_used_code/Applying_WOfS_bitmasking.ipynb) notebooks. 

Accuracy assessment for WOfS product in Africa includes generating a confusion error matrix for a WOFL binary classification.
The inputs for the estimating the accuracy of WOfS derived product are a binary classification WOFL layer showing water/non-water and a shapefile containing validation points collected by [Collect Earth Online](https://collect.earth/) tool. Validation points are the ground truth or actual data while the extracted value for each location from WOFL is the predicted value. A confusion error matrix containing overall, producer's and user's accuracy is the output of this analysis. 

## Description
This notebook explains how you can perform accuracy assessment for WOFS derived product using collected ground truth dataset. 

The notebook demonstrates how to:

1. Load validation points for each partner institutions as a list of observations each has a location and month
2. Query WOFL data for validation points and capture available WOfS observation available
3. Extract statistics for each WOfS observation in each validation point including min, max and mean values for each point (location and month)
4. Extract a LUT for each point that contains both validation info and WOfS result for each month 
5. Generating a confusion error matrix for WOFL classification
6. Assessing the accuracy of the classification 
***

* Two extreme cases: 
    - only test wofs classifier and excluding clouds is ok 
     - keep clear observations and remove non-clear ones
     - then query on those that are water/non-water
    - include terrain so water observed and no terrain is predicted true 

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

After finishing the analysis, you can modify some values in the "Analysis parameters" cell and re-run the analysis to load WOFLs for a different location or time period.

### Load packages
Import Python packages that are used for the analysis.

In [23]:
%matplotlib inline

import time 
import datacube
from datacube.utils import masking, geometry 
import sys
import os
import dask 
import rasterio, rasterio.features
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import subprocess as sp
import matplotlib.pyplot as plt
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

sys.path.append("../Scripts")
from rasterio.mask import mask
from geopandas import GeoSeries, GeoDataFrame
from shapely.geometry import Point
from deafrica_plotting import map_shapefile,display_map, rgb
from deafrica_spatialtools import xr_rasterize
from deafrica_datahandling import wofs_fuser, mostcommon_crs,load_ard,deepcopy
from deafrica_dask import create_local_dask_cluster

### Connect to the datacube
Activate the datacube database, which provides functionality for loading and displaying stored Earth observation data.

In [24]:
dc = datacube.Datacube(app='WOfS_accuracy')

### Set up a Dask cluster

Dask can be used to better manage memory use and conduct the analysis in parallel. 
For an introduction to using Dask with Digital Earth Africa, see the [Dask notebook](../Beginners_guide/08_Parallel_processing_with_dask.ipynb).

>**Note**: We recommend opening the Dask processing window to view the different computations that are being executed; to do this, see the *Dask dashboard in DE Africa* section of the [Dask notebook](../Beginners_guide/08_Parallel_processing_with_dask.ipynb).

To activate Dask, set up the local computing cluster using the cell below.

In [25]:
create_local_dask_cluster()

0,1
Client  Scheduler: tcp://127.0.0.1:37767  Dashboard: /user/neginm/proxy/46247/status,Cluster  Workers: 1  Cores: 2  Memory: 14.18 GB


### Analysis parameters

1. Load validation points for each partner institutions as a list of observations each has a location and month
    * Load the`csv` validation file as pandas dataframe
    * Convert the pandas dataframe into ground_truth `shapefile`

### Loading Dataset

In [26]:
#convert the pandas geo-dataframe to a shapefile and save it 
#Read in the validation data csv
# CEO = '../Supplementary_data/Validation/Refined/CEO_RCMRD_2020-07-30.csv'
# df = pd.read_csv(CEO,delimiter=",")
# geometry = [Point(xy) for xy in zip(df.LON, df.LAT)]
# crs = {'init': 'epsg:4326'} 
# geo_df = GeoDataFrame(df, crs=crs, geometry=geometry)
# geo_df.to_file(driver='ESRI Shapefile', filename='../Supplementary_data/Validation/Refined/groundtruth_RCMRD.shp')

In [27]:
#Path to the validation data points shapefile 
# GT = gpd.read_file('../Supplementary_data/Validation/Refined/groundtruth_RCMRD.shp')
# #reproject the shapefile from EPSG:4326 to match WOfS dataset 
# ground_truth  = GT.to_crs({'init': 'epsg:6933'})

In [28]:
#path = '../Supplementary_data/Validation/subset_clip.shp'
path = '../Supplementary_data/Validation/Refined/groundtruth_RCMRD.shp'

In [29]:
#open shapefile and ensure its in WGS84 coordinates
input_data = gpd.read_file(path).to_crs('epsg:4326')
#check the shapfile by plotting it
#map_shapefile(input_data, attribute='CLASS')

In [30]:
input_data.columns

Index(['Unnamed_ 0', 'Unnamed__1', 'PLOT_ID', 'LON', 'LAT', 'FLAGGED',
       'ANALYSES', 'SENTINEL2Y', 'WATER', 'NO_WATER', 'BAD_IMAGE', 'NOT_SURE',
       'CLASS', 'COMMENT', 'MONTH', 'WATERFLAG', 'geometry'],
      dtype='object')

To do the accuracy assesssment of the validation in each AEZ , we need to obtain WOfS surface water observation data that corresponds with the labelled input data locations. 

The function `collect_training_data` takes our shapefile containing class labels and extracts training data from the datacube over the location specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing an `NaN` (not-a-number) values.


> **The following cell can take several minutes to run.** The class labels will be contained in the first column of the output array.  If you set `ncpus > 1`, then this function will be run in parallel across the specified number of processes.

In [31]:
#clean the first two columns
input_data= input_data.drop(['Unnamed_ 0','Unnamed__1'], axis=1)

In [32]:
coords = [(x,y) for x, y in zip(input_data.geometry.x, input_data.geometry.y)]

In [33]:
pd.date_range('01-2018','01-2019', freq='M')

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

## Sample WOfS at the ground truth coordinates
To load WOFL data, we can first create a re-usable query as below that will define the time period we are interested in, as well as other important parameters that are used to correctly load the data. 

As WOFLs are created scene-by-scene, and some scenes overlap, it's important when loading data to `group_by` solar day, and ensure that the data between scenes is combined correctly by using the WOfS `fuse_func`.
This will merge observations taken on the same day, and ensure that important data isn't lost when overlapping datasets are combined.

In [34]:
#generate query object 
#need to rethink the items x and y for the coordinates 
query ={'resolution':(-30, 30),
        'align':(15,15),
        'group_by':'solar_day'}

We can convert the WOFL bit field into a binary array containing True and False values. This allows us to use the WOFL data as a mask that can be applied to other datasets.
The `make_mask` function allows us to create a mask using the flag labels (e.g. "wet" or "dry") rather than the binary numbers we used above. For more details on how to do masking on WOfS, see the [Applying_WOfS_bit_masking](../Frequently_used_code/Applying_WOfS_bitmasking.ipynb) notebook.

In [35]:
# Define a mask for no clear observations 
#no_clear = {"cloud_shadow":True, "cloud":True, "nodata":True}
# Define a mask for wet and clear pixels 
wet_nocloud = {"water_observed":True, "cloud_shadow":False, "cloud":False,"nodata":False}
# Define a mask for dry and clear pixels 
dry_nocloud = {"water_observed":False, "cloud_shadow":False, "cloud":False, "nodata":False}

In [36]:
#%time  #in order to get the timing 

##input_data['CLASS_WET','CLASS_DRY'] = None

##Step 1: update the query for WOfS based on CEO input table 
for index, row in input_data.iterrows():
    #i = 0
    #print(" Feature {:04}/{:04}\r".format(i + 1, len(input_data)),end='')  #for monitoring performance  
    #get the month value for each index
    month = input_data.loc[index]['MONTH'] 
    #set the time for query of the WOfS database according to the month value in the validation table 
    time = '2018-' + f'{month:02d}' 
    #this is for having the original query as it is 
    dc_query = deepcopy(query) 
    geom = geometry.Geometry(input_data.geometry.values[index].__geo_interface__, geometry.CRS('epsg:4326'))
    q = {"geopolygon":geom}
    t = {"time":time} 
    dc_query.update(t)
    dc_query.update(q)
    wofls = dc.load(product ="ga_ls8c_wofs_2", output_crs = 'EPSG:6933', fuse_func=wofs_fuser,**dc_query).squeeze() 
    
    #freq = dc.load(product = "ga_ls8c_wofs_2_summary", output_crs = 'EPSG:6933', fuse_func=wofs_fuser,**dc_query).squeeze()
    #(freq.frequency*100).plot(figsize=(7,7))# wofls is a dataset here and you need to convert it to xarray
    #WOfS_freq = freq.frequency*100
    
##Step 2: update the table with count of clear observations that are either wet or dry in that particular month 

    #mask the wofls for wet and clear pixels 
    wofl_wetnocloud = masking.make_mask(wofls, **wet_nocloud).astype(int) #using astype(int) will convert the original bool value to integer 
    wofl_wet = wofl_wetnocloud.water.sum()#(dim='time') #sum will act as count here 
    #mask the wofls for wet and dry pixels 
    wofl_drynocloud = masking.make_mask(wofls, **dry_nocloud).astype(int)
    wofl_dry = wofl_drynocloud.water.sum()#(dim='time')
#     #mask the wofls for no clear pixels 
#     noclear = masking.make_mask(wofls, **no_clear).astype(int)
#     wofl_no_clear = noclear.water.sum()
    #update the column for the clear and water observation 
    input_data.at[index,'CLASS_WET'] = wofl_wet.values 
    #update the column for the clear and no water observaion 
    input_data.at[index,'CLASS_DRY'] = wofl_dry.values
    #update the column for the clear water observation frequency 
    wofl_clear_observations = wofl_dry + wofl_wet 
    #update the column for the clear observation 
    input_data.at[index,'CLEAR_OBS'] = (wofl_clear_observations).values
    #update the column for the frequency of clear and wet observations 
    input_data.at[index,'FREQUENCY'] = (wofl_wet / wofl_clear_observations).values 
#     if index > 10: 
#         break 
#    i+=1

In [37]:
input_data

Unnamed: 0,PLOT_ID,LON,LAT,FLAGGED,ANALYSES,SENTINEL2Y,WATER,NO_WATER,BAD_IMAGE,NOT_SURE,CLASS,COMMENT,MONTH,WATERFLAG,geometry,CLASS_WET,CLASS_DRY,CLEAR_OBS,FREQUENCY
0,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,1,1,POINT (29.87585 2.17879),1.0,0.0,1.0,1.0
1,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,2,1,POINT (29.87585 2.17879),1.0,0.0,1.0,1.0
2,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,3,1,POINT (29.87585 2.17879),1.0,0.0,1.0,1.0
3,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,4,1,POINT (29.87585 2.17879),0.0,0.0,0.0,
4,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,5,1,POINT (29.87585 2.17879),2.0,0.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8672,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,10,1,POINT (34.41070 -9.88126),0.0,1.0,1.0,0.0
8673,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,11,1,POINT (34.41070 -9.88126),0.0,0.0,0.0,
8674,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,2,2,POINT (34.41070 -9.88126),1.0,0.0,1.0,1.0
8675,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,4,2,POINT (34.41070 -9.88126),1.0,0.0,1.0,1.0


In [38]:
input_data.to_csv(('../Supplementary_data/Validation/Refined/ground_truth_RCMRD2.csv'))

In [40]:
#need to clean the table based on two criteria:
#drop any row that has water flag more than 1 and also clear observation equal to zero 
indexNames = input_data[(input_data['WATERFLAG'] > 1) & (input_data['CLEAR_OBS'] == 0)].index #that is for or you need to use |
input_data.drop(indexNames, inplace=True)

In [41]:
input_data

Unnamed: 0,PLOT_ID,LON,LAT,FLAGGED,ANALYSES,SENTINEL2Y,WATER,NO_WATER,BAD_IMAGE,NOT_SURE,CLASS,COMMENT,MONTH,WATERFLAG,geometry,CLASS_WET,CLASS_DRY,CLEAR_OBS,FREQUENCY
0,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,1,1,POINT (29.87585 2.17879),1.0,0.0,1.0,1.0
1,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,2,1,POINT (29.87585 2.17879),1.0,0.0,1.0,1.0
2,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,3,1,POINT (29.87585 2.17879),1.0,0.0,1.0,1.0
3,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,4,1,POINT (29.87585 2.17879),0.0,0.0,0.0,
4,137387037.0,29.875854,2.178788,0.0,1.0,2018,1-12,0,10,0,Open water - freshwater,Point is within the river channel,5,1,POINT (29.87585 2.17879),2.0,0.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8672,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,10,1,POINT (34.41070 -9.88126),0.0,1.0,1.0,0.0
8673,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,11,1,POINT (34.41070 -9.88126),0.0,0.0,0.0,
8674,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,2,2,POINT (34.41070 -9.88126),1.0,0.0,1.0,1.0
8675,137387436.0,34.410695,-9.881259,0.0,1.0,2018,"1,3,5-11",0,2412,0,Open water - freshwater,,4,2,POINT (34.41070 -9.88126),1.0,0.0,1.0,1.0


In [42]:
input_data.to_csv(('../Supplementary_data/Validation/Refined/ground_truth_RCMRD_filtered2.csv'))

In [43]:
#step3 : Figure out what to do with multiple time-steps in month, 'mode?'
#step4 : reclassify into same shcema as validaton data (0,1,2 or 3)

In [None]:
#pred_tif = 'path to the table 
pred_tif = 'path to the table extracted above '
#Path to the AEZ region 
aez_region = '../Supplementary_data/Validation/simplified_AEZs/AEZs_ExcludeLargeWB_update.shp'
#areas = glob.glob('../../shapes/simplified_AEZs/*.shp') #using aez could be optional 

In [20]:
#you need to read two columns from this table:
# a. Water flag as the groundtruth(actual)
# b. Class Wet from WOfS as prediction 
#rename the column class to actual 
input_data = input_data.rename(columns={'WATERFLAG':'Actual','CLASS_WET':'Prediction'})
input_data.head()

Unnamed: 0,PLOT_ID,LON,LAT,FLAGGED,ANALYSES,SENTINEL2Y,WATER,NO_WATER,BAD_IMAGE,NOT_SURE,CLASS,COMMENT,MONTH,Actual,geometry,Prediction,CLASS_DRY,CLEAR_OBS,FREQUENCY
0,137387198.0,32.787356,1.24291,0.0,1.0,2018,"1,5,7-12",2,346,0,Wetlands - freshwater,none,2,0,POINT (32.78736 1.24291),1.0,1.0,2.0,0.5
1,137387198.0,32.787356,1.24291,0.0,1.0,2018,"1,5,7-12",2,346,0,Wetlands - freshwater,none,1,1,POINT (32.78736 1.24291),0.0,1.0,1.0,0.0
2,137387198.0,32.787356,1.24291,0.0,1.0,2018,"1,5,7-12",2,346,0,Wetlands - freshwater,none,5,1,POINT (32.78736 1.24291),0.0,0.0,0.0,
3,137387198.0,32.787356,1.24291,0.0,1.0,2018,"1,5,7-12",2,346,0,Wetlands - freshwater,none,7,1,POINT (32.78736 1.24291),0.0,1.0,1.0,0.0
4,137387198.0,32.787356,1.24291,0.0,1.0,2018,"1,5,7-12",2,346,0,Wetlands - freshwater,none,8,1,POINT (32.78736 1.24291),0.0,0.0,0.0,


In [274]:
#An optional stage, if there is a need to read the AEZ region file and clip the validation points 
aez=gpd.read_file(aez_region).to_crs('EPSG:6933')
# clip validation points to region (optional)
ground_truth = gpd.overlay(ground_truth,aez,how='intersection')

### Calculate confusion matrix 

In [22]:
confusion_matrix = pd.crosstab(input_data['Actual'],
                               input_data['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)
confusion_matrix

Prediction,0.0,1.0,2.0,3.0,4.0,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,39,2,0,0,0,41
1,150,169,64,8,2,393
2,20,33,11,0,0,64
All,209,204,75,8,2,498


In [2]:
print(datacube.__version__)

1.8.2.dev7+gdcab0e02


***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** January 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)

In [None]:
#test the groundtruth with a 6933 EPSG as well (conversion) - how to reproject