# WOfS Validation_Data-Clean  <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:** 
[ga_ls8c_wofs_2](https://explorer.digitalearth.africa/ga_ls8c_wofs_2),
[ga_ls8c_wofs_2_summary ](https://explorer.digitalearth.africa/ga_ls8c_wofs_2_summary)

## Background
The [Water Observations from Space (WOfS)](https://www.ga.gov.au/scientific-topics/community-safety/flood/wofs/about-wofs) is a derived product from Landsat 8 satellite observations as part of provisional Landsat 8 Collection 2 surface reflectance and shows surface water detected in Africa.
Individual water classified images are called Water Observation Feature Layers (WOFLs), and are created in a 1-to-1 relationship with the input satellite data. 
Hence there is one WOFL for each satellite dataset processed for the occurrence of water.

The data in a WOFL is stored as a bit field. This is a binary number, where each digit of the number is independantly set or not based on the presence (1) or absence (0) of a particular attribute (water, cloud, cloud shadow etc). In this way, the single decimal value associated to each pixel can provide information on a variety of features of that pixel. 
For more information on the structure of WOFLs and how to interact with them, see [Water Observations from Space](../Datasets/Water_Observations_from_Space.ipynb) and [Applying WOfS bitmasking](../Frequently_used_code/Applying_WOfS_bitmasking.ipynb) notebooks. 

## Description
This notebook explains how you can compile tables from Collect Earth Online tool from each partner institution and make them analysis-ready for WOfS analysis and accuracy assessment. 

The notebook demonstrates how to:

1. Load collected validation points as a list of observations each has a location and month
2. Data wrangling including cleaning the table, and mapping each point two twelve month observation 

***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

After finishing the analysis, you can modify some values in the "Analysis parameters" cell and re-run the analysis to load WOFLs for a different location or time period.

### Load packages

In [123]:
%matplotlib inline

import datacube
from datacube.utils import masking, geometry 
import sys
import os
import dask 
import rasterio, rasterio.features
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import subprocess as sp
import matplotlib.pyplot as plt
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

sys.path.append("../Scripts")
from deafrica_plotting import display_map, rgb
from deafrica_spatialtools import xr_rasterize
from deafrica_datahandling import wofs_fuser, mostcommon_crs,load_ard
from rasterio.mask import mask

### Connect to the datacube

In [124]:
dc = datacube.Datacube()

### Analysis parameters

In [125]:
#make sure that validation points have at least three columns : location (x,y), class, as well as 12 records for each observation  
#Path to the validation data points csv file 
CEO = '../Supplementary_data/Validation/CEO_1_AGRI_2020-10-28.csv'

### Loading Dataset

In [126]:
#Read in the validation data csv
df = pd.read_csv(CEO, delimiter=",")
df.columns

Index(['PLOT_ID', 'SAMPLE_ID', 'LON', 'LAT', 'FLAGGED', 'ANALYSES', 'USER_ID',
       'COLLECTION_TIME', 'ANALYSIS_DURATION', 'IMAGERY_TITLE',
       'GEEIMAGECOLLECTIONASSETID', 'GEEIMAGECOLLECTIONENDDATE',
       'GEEIMAGECOLLECTIONSTARTDATE', 'PL_PLOTID',
       'ENTER MONTHS[1-12] IN 2018, WATER WAS OBSERVED?',
       'ENTER MONTHS[1-12] IN 2018, WATER WAS NOT OBSERVED?',
       'ENTER MONTHS[1-12] IN 2018, IMAGE WAS BAD?',
       'ENTER MONTHS[1-12] IN 2018, THAT YOU ARE UNSURE IF YOU OBSERVE WATER OR NOT? ',
       'WHAT IS THE FEATURE?', 'COMMENT'],
      dtype='object')

In [127]:
ground_truth = df.drop(['SAMPLE_ID','USER_ID','IMAGERY_TITLE','COLLECTION_TIME','ANALYSIS_DURATION','GEEIMAGECOLLECTIONASSETID','PL_PLOTID'], axis=1)

In [128]:
ground_truth.columns

Index(['PLOT_ID', 'LON', 'LAT', 'FLAGGED', 'ANALYSES',
       'GEEIMAGECOLLECTIONENDDATE', 'GEEIMAGECOLLECTIONSTARTDATE',
       'ENTER MONTHS[1-12] IN 2018, WATER WAS OBSERVED?',
       'ENTER MONTHS[1-12] IN 2018, WATER WAS NOT OBSERVED?',
       'ENTER MONTHS[1-12] IN 2018, IMAGE WAS BAD?',
       'ENTER MONTHS[1-12] IN 2018, THAT YOU ARE UNSURE IF YOU OBSERVE WATER OR NOT? ',
       'WHAT IS THE FEATURE?', 'COMMENT'],
      dtype='object')

In [129]:
ground_truth.shape

(725, 13)

In [130]:
ground_truth = ground_truth.rename(columns={'WHAT IS THE FEATURE?':'CLASS','ENTER MONTHS[1-12] IN 2018, WATER WAS OBSERVED?':'WATER','SENTINEL2MOSAICYEARMONTH':'S2DATE',
                                            'ENTER MONTHS[1-12] IN 2018, WATER WAS NOT OBSERVED?':'NO_WATER','ENTER MONTHS[1-12] IN 2018, IMAGE WAS BAD?':'BAD_IMAGE',
                                             'ENTER MONTHS[1-12] IN 2018, THAT YOU ARE UNSURE IF YOU OBSERVE WATER OR NOT? ':'NOT_SURE',
                                            'GEEIMAGECOLLECTIONENDDATE':'ENDDATE','GEEIMAGECOLLECTIONSTARTDATE':'STARTDATE'})

In [131]:
ground_truth

Unnamed: 0,PLOT_ID,LON,LAT,FLAGGED,ANALYSES,ENDDATE,STARTDATE,WATER,NO_WATER,BAD_IMAGE,NOT_SURE,CLASS,COMMENT
0,137711631,17.782114,7.802986,False,1,5/12/2018,1/12/2018,1235612,810,47912,0,Open water - freshwater,
1,137711632,17.982660,7.455957,False,1,5/06/2018,1/06/2018,1261112,0,478910,35,Open water - freshwater,
2,137711633,24.357867,6.961847,False,1,5/10/2018,1/10/2018,123,1012,45678911,0,Open water - freshwater,
3,137711634,12.709994,6.525273,False,1,5/10/2018,1/10/2018,1410,0,23567891112,0,Open water - freshwater,
4,137711635,17.091860,6.464220,False,1,5/01/2018,1/01/2018,1,0,2-Dec,0,Open water - freshwater,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
720,137712351,-7.552680,4.444137,False,1,5/12/2018,1/12/2018,12,0,3-Dec,0,Open water - marine,
721,137712352,6.026038,4.435646,False,1,5/12/2018,1/12/2018,12312,0,4-Nov,0,Open water - marine,
722,137712353,5.840416,4.426212,False,1,5/12/2018,1/12/2018,12311,0,4-Oct,12,Open water - marine,
723,137712354,6.631720,4.347916,False,1,5/10/2018,1/10/2018,12381112,0,4567910,0,Open water - marine,


In [132]:
#Converting column type to string if not already
ground_truth['NOT_SURE'] = ground_truth.NOT_SURE.astype(str)
ground_truth['WATER'] = ground_truth.WATER.astype(str)
ground_truth['NO_WATER'] = ground_truth.NO_WATER.astype(str)

In [133]:
cols = ['WATER','NO_WATER','BAD_IMAGE','NOT_SURE']
for col in cols:
    ground_truth[col] = ground_truth[col].str.replace('[','')
    ground_truth[col] = ground_truth[col].str.replace(']','')
    ground_truth[col] = ground_truth[col].str.replace('&','')
    #ground_truth[col] = [''.join(c.split()) for c in ground_truth[col]]

In [134]:
#check whether any nan values in the dataframe and print it out against the column name 
count_nan_in_df = ground_truth.isnull().sum()
print (count_nan_in_df)

PLOT_ID        0
LON            0
LAT            0
FLAGGED        0
ANALYSES       0
ENDDATE        1
STARTDATE      1
WATER          0
NO_WATER       0
BAD_IMAGE      0
NOT_SURE       0
CLASS          0
COMMENT      699
dtype: int64


In [135]:
#replacing the name of months with their numerical values
replacements = { 'WATER': {r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'},
               'NO_WATER': {r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'},
               'BAD_IMAGE':{r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'}}

ground_truth.replace(replacements, regex=True, inplace=True)

In [136]:
#ground_truth['S2DATE'] = ground_truth['S2DATE'].str.replace('2019-2019','2018-2018')

In [137]:
def split_str(row, newtable):
#check each row for No-WATER info an update the water column 
    monthstr=row['NO_WATER']
    if monthstr!='0'and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','STARTDATE','ENDDATE','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='0'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow)
#check each row for water info 
    monthstr=row['WATER']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','STARTDATE','ENDDATE','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='1'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow)  # update index / ignore original index
#check each row for bad image 
    monthstr=row['BAD_IMAGE']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','STARTDATE','ENDDATE','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='2'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) 
    monthstr=row['NOT_SURE']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','STARTDATE','ENDDATE','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='3'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) 
                
    return newtable

In [138]:
# count_nan_in_df = ground_truth.isnull().sum()
# print (count_nan_in_df)

In [139]:
#ground_truth.dtypes

In [140]:
#Making an empty dataframe
result = pd.DataFrame()

In [141]:
for irow in range(len(ground_truth)):
    result=split_str(ground_truth.iloc[irow], result)
    result.update(result)

In [142]:
result.shape
#result.loc[13]#this shows all the table 

(8724, 16)

In [143]:
result

Unnamed: 0,ANALYSES,BAD_IMAGE,CLASS,COMMENT,ENDDATE,FLAGGED,LAT,LON,MONTH,NOT_SURE,NO_WATER,PLOT_ID,SENTINEL2YEAR,STARTDATE,WATER,WATERFLAG
0,1.0,47912,Open water - freshwater,,5/12/2018,0.0,7.802986,17.782114,08,0,810,137711631.0,2018,1/12/2018,1235612,0
0,1.0,47912,Open water - freshwater,,5/12/2018,0.0,7.802986,17.782114,10,0,810,137711631.0,2018,1/12/2018,1235612,0
0,1.0,47912,Open water - freshwater,,5/12/2018,0.0,7.802986,17.782114,01,0,810,137711631.0,2018,1/12/2018,1235612,1
0,1.0,47912,Open water - freshwater,,5/12/2018,0.0,7.802986,17.782114,02,0,810,137711631.0,2018,1/12/2018,1235612,1
0,1.0,47912,Open water - freshwater,,5/12/2018,0.0,7.802986,17.782114,03,0,810,137711631.0,2018,1/12/2018,1235612,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
724,1.0,24567910,Open water - marine,,5/02/2018,0.0,4.329523,6.246484,05,0,0,137712355.0,2018,1/02/2018,1381112,2
724,1.0,24567910,Open water - marine,,5/02/2018,0.0,4.329523,6.246484,06,0,0,137712355.0,2018,1/02/2018,1381112,2
724,1.0,24567910,Open water - marine,,5/02/2018,0.0,4.329523,6.246484,07,0,0,137712355.0,2018,1/02/2018,1381112,2
724,1.0,24567910,Open water - marine,,5/02/2018,0.0,4.329523,6.246484,09,0,0,137712355.0,2018,1/02/2018,1381112,2


In [22]:
# indexNames = result[result.duplicated(['LAT', 'LON','MONTH'], keep=False)]
# indexNames.shape

In [144]:
result = result[['PLOT_ID', 'LON', 'LAT','FLAGGED','ANALYSES','SENTINEL2YEAR', 'STARTDATE','ENDDATE','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS', 'COMMENT', 'MONTH','WATERFLAG']]

In [145]:
indexNames = result[result.duplicated(['LAT', 'LON','MONTH'], keep=False) & (result['WATERFLAG'] <= '1') & (result['NOT_SURE']!='0')].index
result.drop(indexNames , inplace=True)

In [146]:
result

Unnamed: 0,PLOT_ID,LON,LAT,FLAGGED,ANALYSES,SENTINEL2YEAR,STARTDATE,ENDDATE,WATER,NO_WATER,BAD_IMAGE,NOT_SURE,CLASS,COMMENT,MONTH,WATERFLAG
0,137711631.0,17.782114,7.802986,0.0,1.0,2018,1/12/2018,5/12/2018,1235612,810,47912,0,Open water - freshwater,,08,0
0,137711631.0,17.782114,7.802986,0.0,1.0,2018,1/12/2018,5/12/2018,1235612,810,47912,0,Open water - freshwater,,10,0
0,137711631.0,17.782114,7.802986,0.0,1.0,2018,1/12/2018,5/12/2018,1235612,810,47912,0,Open water - freshwater,,01,1
0,137711631.0,17.782114,7.802986,0.0,1.0,2018,1/12/2018,5/12/2018,1235612,810,47912,0,Open water - freshwater,,02,1
0,137711631.0,17.782114,7.802986,0.0,1.0,2018,1/12/2018,5/12/2018,1235612,810,47912,0,Open water - freshwater,,03,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
724,137712355.0,6.246484,4.329523,0.0,1.0,2018,1/02/2018,5/02/2018,1381112,0,24567910,0,Open water - marine,,05,2
724,137712355.0,6.246484,4.329523,0.0,1.0,2018,1/02/2018,5/02/2018,1381112,0,24567910,0,Open water - marine,,06,2
724,137712355.0,6.246484,4.329523,0.0,1.0,2018,1/02/2018,5/02/2018,1381112,0,24567910,0,Open water - marine,,07,2
724,137712355.0,6.246484,4.329523,0.0,1.0,2018,1/02/2018,5/02/2018,1381112,0,24567910,0,Open water - marine,,09,2


In [147]:
#group by PLOT Id and then do the estimation of the row number for each plot number in column Month  
count = result.groupby(['PLOT_ID'])['MONTH'].count()
count.to_csv('../Supplementary_data/Validation/Refined/CEO_1_AGRYHMET_count.csv')

In [148]:
#save the dataframe as csv file 
result.to_csv('../Supplementary_data/Validation/Refined/AGRYHMET/CEO_1_AGRYHMET_2020-10-28.csv')

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** January 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)