# WOfS Validation Accuracy Assessment <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:** 
[ga_ls8c_wofs_2](https://explorer.digitalearth.africa/ga_ls8c_wofs_2),
[ga_ls8c_wofs_2_summary ](https://explorer.digitalearth.africa/ga_ls8c_wofs_2_summary)

## Background
Accuracy assessment for WOfS product in Africa includes generating a confusion error matrix for a WOFL binary classification.
The inputs for the estimating the accuracy of WOfS derived product are a binary classification WOFL layer showing water/non-water and a shapefile containing validation points collected by [Collect Earth Online](https://collect.earth/) tool. Validation points are the ground truth or actual data while the extracted value for each location from WOFL is the predicted value. A confusion error matrix containing overall, producer's and user's accuracy is the output of this analysis. 

## Description
This notebook explains how you can perform accuracy assessment for WOFS derived product using collected ground truth dataset. 

The notebook demonstrates how to:
1. Generating a confusion error matrix for WOFL binary classification
2. Assessing the accuracy of the classification 
***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

After finishing the analysis, you can modify some values in the "Analysis parameters" cell and re-run the analysis to load WOFLs for a different location or time period.

### Load packages
Import Python packages that are used for the analysis.

In [84]:
%matplotlib inline

import time 
import datacube
from datacube.utils import masking, geometry 
import sys
import os
import dask 
import rasterio, rasterio.features
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import subprocess as sp
import matplotlib.pyplot as plt
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

sys.path.append("../Scripts")
from rasterio.mask import mask
from geopandas import GeoSeries, GeoDataFrame
from shapely.geometry import Point
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn.metrics import plot_confusion_matrix, f1_score  
from deafrica_plotting import map_shapefile,display_map, rgb
from deafrica_spatialtools import xr_rasterize
from deafrica_datahandling import wofs_fuser, mostcommon_crs,load_ard,deepcopy
from deafrica_dask import create_local_dask_cluster

### Loading Dataset

Read in the validation data csv, clean the table and rename the column associated with actual and predicted. 

We need to read two columns from this table:
- Water flag as the groundtruth(actual)
- Class Wet from WOfS (prediction)

In [113]:
#Read the ground truth data following analysis step 
#CEO = '../Supplementary_data/Validation/Refined/NewAnalysis/Continent/WOfS_processed/Intitutions/Point_Based/AEZs/ValidationPoints_Southern.csv'
CEO = '../Supplementary_data/Validation/Refined/NewAnalysis/Continent/WOfS_processed/Intitutions/Point_Based/Africa_ValidationPoints.csv'

df = pd.read_csv(CEO,delimiter=",")

In [114]:
df.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'PLOT_ID', 'LON', 'LAT', 'FLAGGED',
       'ANALYSES', 'SENTINEL2Y', 'STARTDATE', 'ENDDATE', 'WATER', 'NO_WATER',
       'BAD_IMAGE', 'NOT_SURE', 'CLASS', 'COMMENT', 'MONTH', 'WATERFLAG',
       'geometry', 'CLASS_WET', 'CLEAR_OBS'],
      dtype='object')

In [115]:
input_data = df.drop(['Unnamed: 0', 'Unnamed: 0.1','FLAGGED', 'ANALYSES','SENTINEL2Y','STARTDATE', 'ENDDATE', 'WATER', 'NO_WATER', 'BAD_IMAGE', 'NOT_SURE','COMMENT','geometry'], axis=1)
input_data = input_data.rename(columns={'WATERFLAG':'ACTUAL'})

In [116]:
input_data

Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS
0,137483175.0,30.463813,-26.653807,Open water - freshwater,1,1,1.0,2.0
1,137483175.0,30.463813,-26.653807,Open water - freshwater,2,1,0.0,0.0
2,137483175.0,30.463813,-26.653807,Open water - freshwater,2,2,0.0,0.0
3,137483175.0,30.463813,-26.653807,Open water - freshwater,3,1,,
4,137483175.0,30.463813,-26.653807,Open water - freshwater,4,1,,
...,...,...,...,...,...,...,...,...
40126,137712355.0,6.246484,4.329523,Open water - marine,5,2,,
40127,137712355.0,6.246484,4.329523,Open water - marine,6,2,,
40128,137712355.0,6.246484,4.329523,Open water - marine,7,2,,
40129,137712355.0,6.246484,4.329523,Open water - marine,9,2,0.0,0.0


In [117]:
countpoints = input_data.groupby('PLOT_ID',as_index=False,sort=False).last()
countpoints

Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS
0,137483175.0,30.463813,-26.653807,Open water - freshwater,12,1,1.0,1.0
1,137483176.0,30.026031,-26.673227,Open water - Constructed (e.g. aquaculture),12,1,0.0,2.0
2,137483177.0,31.700362,-26.746737,Open water - freshwater,12,1,0.0,0.0
3,137483178.0,31.937287,-26.801901,Open water - freshwater,12,2,0.0,0.0
4,137483179.0,27.339949,-26.863925,Open water - freshwater,12,1,1.0,3.0
...,...,...,...,...,...,...,...,...
2895,137712351.0,-7.552680,4.444137,Open water - marine,12,2,0.0,0.0
2896,137712352.0,6.026038,4.435646,Open water - marine,11,2,0.0,0.0
2897,137712353.0,5.840416,4.426212,Open water - marine,12,3,1.0,1.0
2898,137712354.0,6.631720,4.347916,Open water - marine,10,2,0.0,0.0


In [118]:
#setting the column prediction based on frequency flag or using class_wet flag 
#input_data['PREDICTION'] = input_data['FREQUENCY'].apply(lambda x: '1' if x > 0.5 else '0')
input_data['PREDICTION'] = input_data['CLASS_WET'].apply(lambda x: '1' if x >=1 else '0')  

Remove the Duplicated plot IDs which means those that are labeled for similar month as 0, 1, 2  or 3.

In [119]:
Duplicate = input_data.duplicated(['LAT', 'LON','MONTH'], keep=False)
input_data = input_data[Duplicate==False]

In [120]:
input_data

Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS,PREDICTION
0,137483175.0,30.463813,-26.653807,Open water - freshwater,1,1,1.0,2.0,1
3,137483175.0,30.463813,-26.653807,Open water - freshwater,3,1,,,0
4,137483175.0,30.463813,-26.653807,Open water - freshwater,4,1,,,0
5,137483175.0,30.463813,-26.653807,Open water - freshwater,5,1,1.0,1.0,1
6,137483175.0,30.463813,-26.653807,Open water - freshwater,6,1,1.0,3.0,1
...,...,...,...,...,...,...,...,...,...
40126,137712355.0,6.246484,4.329523,Open water - marine,5,2,,,0
40127,137712355.0,6.246484,4.329523,Open water - marine,6,2,,,0
40128,137712355.0,6.246484,4.329523,Open water - marine,7,2,,,0
40129,137712355.0,6.246484,4.329523,Open water - marine,9,2,0.0,0.0,0


In [121]:
#count the number of duplicates 
count21 = input_data.groupby('PLOT_ID',as_index=False,sort=False).last()
count21

Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS,PREDICTION
0,137483175.0,30.463813,-26.653807,Open water - freshwater,12,1,1.0,1.0,1
1,137483176.0,30.026031,-26.673227,Open water - Constructed (e.g. aquaculture),12,1,0.0,2.0,0
2,137483177.0,31.700362,-26.746737,Open water - freshwater,12,1,0.0,0.0,0
3,137483178.0,31.937287,-26.801901,Open water - freshwater,11,1,0.0,0.0,0
4,137483179.0,27.339949,-26.863925,Open water - freshwater,12,1,1.0,3.0,1
...,...,...,...,...,...,...,...,...,...
2488,137712351.0,-7.552680,4.444137,Open water - marine,12,2,0.0,0.0,0
2489,137712352.0,6.026038,4.435646,Open water - marine,11,2,0.0,0.0,0
2490,137712353.0,5.840416,4.426212,Open water - marine,12,3,1.0,1.0,1
2491,137712354.0,6.631720,4.347916,Open water - marine,10,2,0.0,0.0,0


The following cell is to filter out those rows that are labeled more than 1 or there is no clear WOfS observations. 

In [122]:
indexNames = input_data[(input_data['ACTUAL'] > 1) | (input_data['CLEAR_OBS']==0.0) | (input_data['CLEAR_OBS'].isna())].index
input_data.drop(indexNames, inplace=True)

In [123]:
count22 = input_data.groupby('PLOT_ID',as_index=False,sort=False).last()
count22

Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS,PREDICTION
0,137483175.0,30.463813,-26.653807,Open water - freshwater,12,1,1.0,1.0,1
1,137483176.0,30.026031,-26.673227,Open water - Constructed (e.g. aquaculture),12,1,0.0,2.0,0
2,137483177.0,31.700362,-26.746737,Open water - freshwater,10,1,1.0,2.0,1
3,137483178.0,31.937287,-26.801901,Open water - freshwater,10,1,0.0,2.0,0
4,137483179.0,27.339949,-26.863925,Open water - freshwater,12,1,1.0,3.0,1
...,...,...,...,...,...,...,...,...,...
2279,137712351.0,-7.552680,4.444137,Open water - marine,2,1,1.0,1.0,1
2280,137712352.0,6.026038,4.435646,Open water - marine,12,1,1.0,1.0,1
2281,137712353.0,5.840416,4.426212,Open water - marine,1,1,1.0,1.0,1
2282,137712354.0,6.631720,4.347916,Open water - marine,3,1,1.0,1.0,1


In order to save the table of valid points, the following cell should be run. Otherwise, skip to the next cell.  

In [175]:
input_data.to_csv(('../Supplementary_data/Validation/Refined/NewAnalysis/Continent/WOfS_processed/Intitutions/Point_Based/AEZs/ValidPoints/Africa_ValidationPoints.csv'))

In [124]:
confusion_matrix = pd.crosstab(input_data['ACTUAL'],input_data['PREDICTION'],rownames=['ACTUAL'],colnames=['PREDICTION'],margins=True)
confusion_matrix

PREDICTION,0,1,All
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3279,432,3711
1,1696,6741,8437
All,4975,7173,12148


`Producer's Accuracy` is the map-maker accuracy showing the probability that a certain class on the ground is classified. Producer's accuracy complements error of omission. 

In [125]:
confusion_matrix["Producer's"] = [confusion_matrix.loc[0][0] / confusion_matrix.loc[0]['All'] * 100, confusion_matrix.loc[1][1] / confusion_matrix.loc[1]['All'] *100, np.nan]
confusion_matrix

PREDICTION,0,1,All,Producer's
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3279,432,3711,88.358933
1,1696,6741,8437,79.898068
All,4975,7173,12148,


`User's Accuracy` is the map-user accuracy showing how often the class on the map will actually be present on the ground. `User's accuracy` shows the reliability. It is calculated based on the total number of correct classification for a particular class over the total number of classified sites.

In [126]:
#For continent 
users_accuracy = pd.Series([confusion_matrix['0'][0] / confusion_matrix['0']['All'] * 100,
                                confusion_matrix['1'][1] / confusion_matrix['1']['All'] * 100]).rename("User's")

confusion_matrix = confusion_matrix.append(users_accuracy)
confusion_matrix 

Unnamed: 0_level_0,0,1,All,Producer's,0,1
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3279.0,432.0,3711.0,88.358933,,
1,1696.0,6741.0,8437.0,79.898068,,
All,4975.0,7173.0,12148.0,,,
User's,,,,,65.909548,93.977415


`Overal Accuracy` shows what proportion of reference(actual) sites mapped correctly.

In [127]:
confusion_matrix.loc["User's", "Producer's"] = (confusion_matrix['0'][0] + confusion_matrix['1'][1]) / confusion_matrix['All']['All'] * 100
confusion_matrix

Unnamed: 0_level_0,0,1,All,Producer's,0,1
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3279.0,432.0,3711.0,88.358933,,
1,1696.0,6741.0,8437.0,79.898068,,
All,4975.0,7173.0,12148.0,,,
User's,,,,82.482713,65.909548,93.977415


In [128]:
input_data['PREDICTION'] = input_data['PREDICTION'] .astype(str).astype(int)

The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1(perfect precision and recall), and is calculated as:

In [129]:
fscore = pd.Series([(2*(confusion_matrix.loc["User's"][0]*confusion_matrix.loc[0]["Producer's"]) / (confusion_matrix.loc["User's"][0] + confusion_matrix.loc[0]["Producer's"])) / 100,
                   f1_score(input_data['ACTUAL'],input_data['PREDICTION'])]).rename("F-score")
confusion_matrix = confusion_matrix.append(fscore)

In [130]:
confusion_matrix

Unnamed: 0_level_0,0,1,All,Producer's,0,1
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3279.0,432.0,3711.0,88.358933,,
1,1696.0,6741.0,8437.0,79.898068,,
All,4975.0,7173.0,12148.0,,,
User's,,,,82.482713,65.909548,93.977415
F-score,,,,,0.755008,0.863677


In [131]:
confusion_matrix = confusion_matrix.round(decimals=2)

In [132]:
confusion_matrix = confusion_matrix.rename(columns={'0':'NoWater','1':'Water', 0:'NoWater',1:'Water','All':'Total'},index={'0':'NoWater','1':'Water',0:'NoWater',1:'Water','All':'Total'})

In [133]:
confusion_matrix

Unnamed: 0_level_0,NoWater,Water,Total,Producer's,NoWater,Water
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NoWater,3279.0,432.0,3711.0,88.36,,
Water,1696.0,6741.0,8437.0,79.9,,
Total,4975.0,7173.0,12148.0,,,
User's,,,,82.48,65.91,93.98
F-score,,,,,0.76,0.86


In [134]:
confusion_matrix.to_csv('../Supplementary_data/Validation/Refined/NewAnalysis/Continent/WOfS_processed/Intitutions/Point_Based/Africa_confusion_matrix.csv')

In [2]:
print(datacube.__version__)

1.8.2.dev7+gdcab0e02


***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** January 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)