# Water Indices

This notebook examines how well water indices like NDWI can detect flooding - specifically in Dakar. The default resolution is 30m (-0.00027, 0.00027), but can be changed. Sentinel-2 has the highest resolution of the datasets used here: 10m for the red, green, blue, and nir bands and 20m for all other data.

Images are exported to an `images` directory within this directory.

For each flood date (of 9), there is 1 image of the RGB and water indices for that date as well as 1 image of the rgb and differences between the water indicies for that date and the medians of the corresponding water indicies from 2009-2019. (18 images)

There is also an image called `haz_map_water_median_minus_median_fig.png` which shows the hazard map and the differences between the medians of the water indicies for the flood dates and the medians of the corresponding water indicies from 2009-2019.

The water indicies used were: WOfS (Water Observations from Space), NDWI, MNDWI, AWEI_ns, and AWEI_sh. These are explained below:
* WOfS: Typically detects large and reasonably deep bodies of water well. Used primarily for comparison.
* NDWI: Particularly useful for detecting plant water content. Used primarily for comparison.
* AWEI_ns: Intended to be more robust to environmental noise than other water indices like NDWI.
* AWEI_sh: Like AWEI_ns, but better corrects for shadows.

# Index

* Import dependencies, setup Dask client, and connect to the data cube
* Load flood hazard data from World Bank
* Show area to load data for
* Load geospatial data
    * Landsat
        * WOfS
        * Landsat 5
        * Landsat 8
    * Sentinel-2
* Merge data
* Mask out the ocean and lakes and obtain the flood hazard map as an xarray
* Show RGB and water indices for flood dates
* Show RGB and the difference of water indices for flood dates from their medians for the selected years
* Show medians of flood dates water indices minus medians for 2009-2019

## Import dependencies, setup Dask client, and connect to the data cube

In [2]:
from collections import ChainMap

import matplotlib.pyplot as plt
import geopandas as gpd
import xarray as xr
import pandas as pd
import numpy as np
import joblib
import os

import sys
sys.path.append('..')
from utils.ceos_utils.dc_display_map import display_map
from utils.deafrica_utils.deafrica_bandindices import \
    calculate_indices
from utils.deafrica_utils.deafrica_datahandling import load_ard

import datacube
dc = datacube.Datacube()

In [3]:
from utils.ceos_utils.dask import create_local_dask_cluster
client = create_local_dask_cluster()

## Load flood hazard data from World Bank

In [4]:
dakar_flood_hazard = gpd.read_file('../floodareas/eo4sd_dakar_fhazard_2018/EO4SD_DAKAR_FHAZARD_2018.shp')

**Remove records with no geometry data**

In [5]:
dakar_flood_hazard = dakar_flood_hazard[[dakar_flood_hazard.geometry[i] is not None for i in range(len(dakar_flood_hazard))]]

**Change the CRS to EPSG:4326**

In [6]:
dakar_flood_hazard = dakar_flood_hazard.to_crs("EPSG:4326")

**Get the bounding box of the data**

In [7]:
dakar_bounds = dakar_flood_hazard.bounds
min_lon = dakar_bounds.minx.min()
max_lon = dakar_bounds.maxx.max()
min_lat = dakar_bounds.miny.min()
max_lat = dakar_bounds.maxy.max()
lat = (min_lat, max_lat)
lon = (min_lon, max_lon)

## Show area to load data for

In [8]:
## Dakar, Senegal
# Small test
# lat = (14.8270, 14.8422)
# lon = (-17.2576, -17.2172)
# Citizen Science Study Area
lat = (14.7711, 14.7993)
lon = (-17.3706, -17.3366)
# Tip
# lat = (14.6433, 14.7892)
# lon = (-17.5408, -17.4158)
# Full
# lat = (14.6285, 14.8725)
# lon = (-17.5348, -17.2068)

## Coast of Sengal
# North
# lat = (14.3559, 16.0974)
# lon = (-17.5683, -16.4543)
# Full
# lat = (12.3016, 16.1810)
# lon = (-17.8198, -16.3257)

In [9]:
display_map(lat, lon)

## Load geospatial data

**Specify time range and common load parameters**

In [10]:
years = range(2009, 2020) # (inclusive, exclusive)
time_range = [f"{years[0]}-01-01", f"{years[-1]}-12-31"]
# time_ranges = [(f"{year}-01-01", f"{year}-12-31") for year in years]

### Flood Times ###

## EO4SD Hazard Map Flood Times ##

# Actual times are these (ranges we choose for this list 
# are to get more data where some may be missing):
# Landsat 5: [2009-10-22, 2010-10-25, 2011-10-12]
# Landsat 8: [2013-10-01, 2014-11-21, 2015-11-08]
# Sentinel-2: [2016-10-30, 2017-10-10, 2018-10-15]
eo4sd_hazard_map_times = np.array([
    "2009-10-22", "2010-10-25", "2011-10-12",
    "2013-10-01", "2014-11-21", "2015-11-08",
    "2016-10-30", "2017-10-10", "2018-10-15"
])

eo4sd_hazard_map_time_ranges = \
[("2009-10-21", "2009-10-24"), ("2010-10-24", "2010-10-27"),
 ("2011-10-11", "2011-10-14"), ("2013-09-30", "2013-10-03"),
 ("2014-11-20", "2014-11-23"), ("2015-11-07", "2015-11-10"),
 ("2016-10-29", "2016-11-01"), ("2017-10-09", "2017-10-13"),
 ("2018-10-14", "2018-10-17")]

## End EO4SD Hazard Map Flood Times ##

## Citizen Science Data Times ##

cit_sci_times = \
[]
# [("2009-03-01", "2009-03-30"), ("2009-10-11", "2009-10-18"),
#  ("2012-03-01", "2012-03-15"), ("2012-08-01", "2012-09-30")]

## End Citizen Science Data Times ##

time_ranges_floods = sorted(eo4sd_hazard_map_time_ranges + cit_sci_times)

### End Flood Times ###

common_load_params = \
    dict(output_crs="EPSG:4326",
         resolution=(-0.00027,0.00027),
         latitude=lat, longitude=lon,
         group_by='solar_day',
         dask_chunks={'time':40, 
                      'latitude':2000, 
                      'longitude':2000})

### Landsat

>### WOfS

In [11]:
from utils.ceos_utils.dc_load import is_dataset_empty

ls_water_wofs_data = dc.load(product='ga_ls8c_wofs_2', 
               measurements=['water'], 
               time=time_range,
               **common_load_params).persist()
assert not is_dataset_empty(ls_water_wofs_data), f"There is no WOfS data for time range {time_range}"

# Formatting water data #
# bit 7 indicates water, bit 2 indicates sea.
ls_water_cls = (ls_water_wofs_data.water&0b10000010)!=0
# Set no_data (missing) values to NaN.
ls_water_cls = \
    ls_water_cls.where(ls_water_wofs_data.water!=1)
del ls_water_wofs_data # Save memory
ls_water_wofs = ls_water_cls.rename('WOfS')
# End formatting water data #

ls_water_wofs_median = ls_water_wofs.median('time').persist()
ls_water_wofs_floods = xr.concat([ls_water_wofs.sel(time=slice(*time_range_flood)) for time_range_flood in time_ranges_floods], dim='time').persist()
ls_water_wofs_floods_minus_median = (ls_water_wofs_floods - ls_water_wofs_median).persist()
ls_water_wofs_floods_median_minus_median = (ls_water_wofs_floods.median('time') - \
                                            ls_water_wofs_median).persist()
del ls_water_wofs_median # Save memory

>### Landsat 5

In [32]:
ls5_data = load_ard(dc=dc, products=['ls5_usgs_sr_scene'], 
                       measurements=['blue', 'green', 'red', 'nir', 'swir1', 'swir2'], 
                       time=time_range,
                       **common_load_params)
ls5_data_water_inds = ls5_data
ls5_water_median = []
water_inds = ['NDWI', 'AWEI_sh', 'AWEI_ns', 'MNDWI']
for water_ind in water_inds:
    ls5_water_ind_data = calculate_indices(ls5_data, index=water_ind, collection='c1')[water_ind]
    ls5_data_water_inds[water_ind] = ls5_water_ind_data
    ls5_water_median.append(ls5_water_ind_data.median('time'))
ls5_water_median = xr.merge(ls5_water_median)

ls5_floods = xr.concat([ls5_data_water_inds.sel(time=slice(*time_range_flood)) for time_range_flood in time_ranges_floods], dim='time')
ls5_water_floods = ls5_floods[water_inds]
ls5_water_floods_minus_median = (ls5_water_floods - ls5_water_median).persist()
ls5_water_floods_median_minus_median = (ls5_floods[water_inds].median('time') - \
                                        ls5_water_median).persist()

ls5_rgb_floods = ls5_floods[['red', 'green', 'blue']].persist()

Using pixel quality parameters for USGS Collection 1
Finding datasets
    ls5_usgs_sr_scene
Applying pixel quality/cloud mask
Returning 35 time steps as a dask array


>### Landsat 8

In [42]:
ls8_data = load_ard(dc=dc, products=['ls8_usgs_sr_scene'], 
                       measurements=['blue', 'green', 'red', 'nir', 'swir1', 'swir2'], 
                       time=time_range,
                       **common_load_params).persist()
ls8_data_water_inds = ls8_data
ls8_water_median = []
water_inds = ['NDWI', 'AWEI_sh', 'AWEI_ns', 'MNDWI']
for water_ind in water_inds:
    ls8_water_ind_data = calculate_indices(ls8_data, index=water_ind, collection='c1')[water_ind]
    ls8_data_water_inds[water_ind] = ls8_water_ind_data
    ls8_water_median.append(ls8_water_ind_data.median('time'))
ls8_water_median = xr.merge(ls8_water_median)

ls8_floods = xr.concat([ls8_data_water_inds.sel(time=slice(*time_range_flood)) for time_range_flood in time_ranges_floods], dim='time')
ls8_water_floods = ls8_floods[water_inds]
ls8_water_floods_minus_median = (ls8_water_floods - ls8_water_median).persist()
ls8_water_floods_median_minus_median = (ls8_floods[water_inds].median('time') - \
                                        ls8_water_median).persist()

ls8_rgb_floods = ls8_floods[['red', 'green', 'blue']].persist()

Using pixel quality parameters for USGS Collection 1
Finding datasets
    ls8_usgs_sr_scene
Applying pixel quality/cloud mask
Returning 265 time steps as a dask array


### Sentinel-2

In [43]:
s2_data = load_ard(dc=dc, products=['s2_l2a'], 
                       measurements=[
                           # Used by MNDWI, AWEI_ns, AWEI_sh
                           'green', 'swir_1', 
                           # Used by AWEI_ns, AWEI_sh
                           'nir', 'swir_2',
                           # Used by AWEI_sh
                           'blue',
                           # Used by NDWI, TCW, WI2015
                           'red'],
                       time=time_range,
                       **common_load_params).persist()
s2_data_water_inds = s2_data
s2_water_median = []
water_inds = ['NDWI', 'AWEI_sh', 'AWEI_ns', 'MNDWI']
for water_ind in water_inds:
    s2_water_ind_data = calculate_indices(s2_data, index=water_ind, collection='c1')[water_ind]
    s2_data_water_inds[water_ind] = s2_water_ind_data
    s2_water_median.append(s2_water_ind_data.median('time'))
s2_water_median = xr.merge(s2_water_median)

s2_floods = xr.concat([s2_data_water_inds.sel(time=slice(*time_range_flood)) for time_range_flood in time_ranges_floods], dim='time')
s2_water_floods = s2_floods[water_inds]
s2_water_floods_minus_median = (s2_water_floods - s2_water_median).persist()
s2_water_floods_median_minus_median = (s2_floods[water_inds].median('time') - \
                                       s2_water_median).persist()

s2_rgb_floods = s2_floods[['red', 'green', 'blue']].persist()

Using pixel quality parameters for Sentinel 2
Finding datasets
    s2_l2a
Applying pixel quality/cloud mask
Returning 199 time steps as a dask array


## Merge data

In [44]:
water_floods_minus_median = xr.merge((ls_water_wofs_floods_minus_median, ls5_water_floods_minus_median,
                                      ls8_water_floods_minus_median, s2_water_floods_minus_median))
water_floods_minus_median = xr.concat([water_floods_minus_median.sel(time=slice(*time_range_flood)).mean('time') 
                                       for time_range_flood in time_ranges_floods], dim='time').compute().reindex({'time': eo4sd_hazard_map_times})

In [45]:
# nrows = len(water_floods_minus_median.time)
# ncols = len(water_floods_minus_median.data_vars)
# fig, ax = plt.subplots(nrows, ncols, figsize=(6*ncols, 4*nrows))

# for time_ind, time in enumerate(water_floods_minus_median.time):
#     for data_var_ind, data_var in enumerate(water_floods_minus_median.data_vars):
#         water_floods_minus_median[data_var].isel(time=time_ind).plot.imshow(ax=ax[time_ind, data_var_ind])
# plt.tight_layout()
# plt.show()

**Median of water indices for flood dates minus medians for 2009 - 2019**

In [46]:
if 'time' not in ls_water_wofs_floods_median_minus_median.dims:
    ls_water_wofs_floods_median_minus_median = \
        ls_water_wofs_floods_median_minus_median.expand_dims({'time':[0]})
if 'time' not in ls5_water_floods_median_minus_median.dims:
    ls5_water_floods_median_minus_median = \
        ls5_water_floods_median_minus_median.expand_dims({'time':[1]})
if 'time' not in ls8_water_floods_median_minus_median.dims:
    ls8_water_floods_median_minus_median = \
        ls8_water_floods_median_minus_median.expand_dims({'time':[2]})
if 'time' not in s2_water_floods_median_minus_median.dims:
    s2_water_floods_median_minus_median = \
        s2_water_floods_median_minus_median.expand_dims({'time':[3]})

In [47]:
water_floods_median_minus_median = \
    xr.merge((ls_water_wofs_floods_median_minus_median, ls5_water_floods_median_minus_median,
              ls8_water_floods_median_minus_median, s2_water_floods_median_minus_median)).mean('time').compute()

**Water for flood dates**

In [48]:
water_floods = xr.merge((ls_water_wofs_floods, ls5_water_floods,
                         ls8_water_floods, s2_water_floods))
water_floods = xr.concat([water_floods.sel(time=slice(*time_range_flood)).mean('time') 
                          for time_range_flood in time_ranges_floods], dim='time').compute().reindex({'time': eo4sd_hazard_map_times})

In [49]:
# nrows = len(water_floods.time)
# ncols = len(water_floods.data_vars)
# fig, ax = plt.subplots(nrows, ncols, figsize=(6*ncols, 4*nrows))

# for time_ind, time in enumerate(water_floods.time):
#     for data_var_ind, data_var in enumerate(water_floods.data_vars):
#         water_floods[data_var].isel(time=time_ind).plot.imshow(ax=ax[time_ind, data_var_ind])
# plt.tight_layout()
# plt.show()

In [50]:
rgb_floods = xr.merge((ls5_rgb_floods, ls8_rgb_floods, s2_rgb_floods))
rgb_floods = xr.concat([rgb_floods.sel(time=slice(*time_range_flood)).mean('time') 
                          for time_range_flood in time_ranges_floods], dim='time').compute().reindex({'time': eo4sd_hazard_map_times})
# Normalize RGB to [0,1].
rgb_min, rgb_max = 0, 4000
rgb_floods = (rgb_floods - rgb_min) / (rgb_max - rgb_min)
rgb_floods = rgb_floods.where(rgb_floods<1, 1)

## Mask out the ocean and lakes and obtain the flood hazard map as an xarray

In [51]:
s2_land_mask = s2_water_floods.MNDWI.mean('time') < 0.1
water_floods_minus_median = water_floods_minus_median.where(s2_land_mask)
water_floods = water_floods.where(s2_land_mask)
rgb_floods = rgb_floods.where(s2_land_mask)
# water_floods_mean_minus_median = water_floods_mean_minus_median.where(s2_land_mask)
water_floods_median_minus_median = water_floods_median_minus_median.where(s2_land_mask)

In [52]:
from utils.deafrica_utils.deafrica_spatialtools import xr_rasterize

flood_hazard_enc = {0:'No Risk', 1:'Low Risk', 2:'Medium Risk', 3:'High Risk'}
flood_hazard_masks = \
{0: xr_rasterize(dakar_flood_hazard[dakar_flood_hazard['RISKCODE_H']==0], 
                 rgb_floods).astype(np.bool).where(s2_land_mask, False),
 1: xr_rasterize(dakar_flood_hazard[dakar_flood_hazard['RISKCODE_H']==1], 
                 rgb_floods).astype(np.bool).where(s2_land_mask, False),
 2: xr_rasterize(dakar_flood_hazard[dakar_flood_hazard['RISKCODE_H']==2], 
                 rgb_floods).astype(np.bool).where(s2_land_mask, False),
 3: xr_rasterize(dakar_flood_hazard[dakar_flood_hazard['RISKCODE_H']==3], 
                 rgb_floods).astype(np.bool).where(s2_land_mask, False)}

Rasterizing to match xarray.DataArray dimensions (106, 127)
Rasterizing to match xarray.DataArray dimensions (106, 127)
Rasterizing to match xarray.DataArray dimensions (106, 127)
Rasterizing to match xarray.DataArray dimensions (106, 127)


In [53]:
flood_hazard_map = None
for val, mask in flood_hazard_masks.items():
    if flood_hazard_map is None:
        flood_hazard_map = xr.full_like(mask, val)
    else:
        flood_hazard_map = flood_hazard_map.where(~mask, val)
flood_hazard_map = flood_hazard_map.where(s2_land_mask)

## Show RGB and water indices for flood dates

In [54]:
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 400

In [55]:
def create_flood_date_figs(dataset, base_file_name, time_ind, diff_median=False):
    NROWS_PER_DATE = 2
    nrows = NROWS_PER_DATE
    ncols = int(np.ceil(len(dataset.data_vars)/NROWS_PER_DATE))
    
    fig, ax = plt.subplots(nrows, ncols, figsize=(7*ncols, 5*nrows))
    
    time = dataset.time[time_ind]
    time_str = pd.to_datetime(time.values).strftime('%Y-%m-%d')
    current_ax = ax[0, 0]
    rgb_floods.isel(time=time_ind).to_array().plot.imshow(ax=current_ax, vmin=0, vmax=1)
    current_ax.set_title(f'RGB ({time_str})')
    for data_var_ind, data_var in enumerate(dataset.data_vars):
        vmin = dataset[data_var].isel(time=time_ind).quantile(0.05).values \
               if not diff_median else 0
        vmax = dataset[data_var].isel(time=time_ind).quantile(0.95).values
        current_ax = ax[int((data_var_ind+1)/ncols), ((data_var_ind+1)%ncols)]
        dataset[data_var].isel(time=time_ind).plot.imshow(ax=current_ax, cmap='Blues', vmin=vmin, vmax=vmax)
        diff_median_text = '' if not diff_median else ' - minus median'
        current_ax.set_title(f'{data_var}{diff_median_text} ({time_str})')
    plt.tight_layout()
    if not os.path.exists('images'):
        os.mkdir('images')
    plt.savefig(f'images/{base_file_name}_{time_ind}.png')
    plt.clf()
    return None

In [56]:
from functools import partial
create_flood_date_figs_rgb_water = \
    partial(create_flood_date_figs, water_floods, 'rgb_water_fig')

for time_ind in range(len(water_floods.time)):
    create_flood_date_figs_rgb_water(time_ind)

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

## Show RGB and the difference of water indices for flood dates from their medians for the selected years

In [57]:
create_flood_date_figs_rgb_water_minus_median = \
    partial(create_flood_date_figs, water_floods_minus_median, 'rgb_water_minus_median_fig')

for time_ind in range(len(water_floods.time)):
    create_flood_date_figs_rgb_water_minus_median(time_ind, diff_median=True)

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

<Figure size 8400x4000 with 0 Axes>

## Show medians of flood dates water indices minus medians for 2009-2019

In [58]:
def create_flood_date_figs_agg(dataset, base_file_name, agg='mean'):
    assert agg in ['mean', 'median'], "The variable agg must be one of ['mean', 'median']"
    
    NROWS_PER_DATE = 2
    nrows = NROWS_PER_DATE
    ncols = int(np.ceil(len(dataset.data_vars)/NROWS_PER_DATE))
    
    fig, ax = plt.subplots(nrows, ncols, figsize=(7*ncols, 5*nrows))
    
    current_ax = ax[0, 0]
    flood_hazard_map.plot.imshow(ax=current_ax)
    current_ax.set_title(f'Given Flood Hazard Map')
    for data_var_ind, data_var in enumerate(dataset.data_vars):
        vmin = 0
        vmax = dataset[data_var].quantile(0.95).values
        current_ax = ax[int((data_var_ind+1)/ncols), ((data_var_ind+1)%ncols)]
        dataset[data_var].plot.imshow(ax=current_ax, cmap='Blues', vmin=vmin, vmax=vmax)
        current_ax.set_title(f'{agg.capitalize()} of flood dates {data_var} - minus median')
    plt.tight_layout()
    if not os.path.exists('images'):
        os.mkdir('images')
    plt.savefig(f'images/{base_file_name}.png')
    plt.clf()
    return None

In [59]:
create_flood_date_figs_agg(water_floods_median_minus_median, 
                           'haz_map_water_median_minus_median_fig', 
                           agg='median')

<Figure size 8400x4000 with 0 Axes>