## Canberra lithologies case study

Motivated by learning that the ACT is interested in managed aquifer recharge for watering some green spaces.

This notebook does not look at AEM data although sitting under a repository suggesting so. 

## Downloading the data 

Not throughly documented.

Data was downloaded from the usual places, NGIS and Elvis. NGIS when using the Murrumbidgee catchment was actually not including the bores in the ACT, so needed to download the ACT ones also, and this present notebook will do the merging of the lithology logs. Spatial locations were merged manually, and subsetted, in QGIS

Some of the data output by this present notebook fed into a [lithology log viewer](https://github.com/csiro-hydrogeology/lithology-viewer) that can be run as a dashboard on Binder.


In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import rasterio
from rasterio.plot import show
import geopandas as gpd


In [None]:
# Only set to True for co-dev of ela from this use case:
ela_from_source = False
ela_from_source = True

In [None]:
if ela_from_source:
    if ('ELA_SRC' in os.environ):
        root_src_dir = os.environ['ELA_SRC']
    elif sys.platform == 'win32':
        root_src_dir = r'C:\src\github_jm\pyela'
    else:
        username = os.environ['USER']
        root_src_dir = os.path.join('/home', username, 'src/ela/pyela')
    pkg_src_dir = root_src_dir
    sys.path.insert(0, pkg_src_dir)

from ela.textproc import *
from ela.utils import *
from ela.classification import *
from ela.visual import *
from ela.spatial import SliceOperation

## Importing data

There are two main sets of information we need: the borehole lithology logs, and the spatial information in the surface elevation (DEM) and geolocation of a subset of bores around Bungendore. 

In [None]:
data_path = None

You probably want to explicitly set `data_path` to the location where you put the folder(s) e.g:

In [None]:
#data_path = '/home/myusername/data' # On Linux, if you now have the folder /home/myusername/data/Bungendore
#data_path = r'C:\data\Lithology'  # windows, if you have C:\data\Lithology\Bungendore

Otherwise a fallback for the pyela developer(s)

In [None]:
if data_path is None:
    if ('ELA_DATA' in os.environ):
        data_path = os.environ['ELA_DATA']
    elif sys.platform == 'win32':
        data_path = r'C:\data\Lithology'
    else:
        username = os.environ['USER']
        data_path = os.path.join('/home', username, 'data')

In [None]:
data_path

In [None]:
cbr_datadir = os.path.join(data_path, 'Canberra')
cbr_datadir_out = os.path.join(cbr_datadir, 'out')
ngis_datadir = os.path.join(data_path, 'NGIS')
act_shp_datadir = os.path.join(ngis_datadir, 'shp_ACT')
bidgee_shp_datadir = os.path.join(ngis_datadir, 'shp_murrumbidgee_river')

In [None]:
write_outputs = True

## DEM


In [None]:
dem = rasterio.open(os.path.join(cbr_datadir,'CLIP.tif'))

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
show(dem,title='Canberra', cmap='terrain',  ax=ax)

## Bore data

In [None]:
bore_locations_raw = gpd.read_file(os.path.join(cbr_datadir, 'Bores/act_bores.shp'))

In [None]:
bore_locations_raw.columns

In [None]:
bore_locations_raw.crs, dem.crs

The DEM raster and the bore location shapefile do not use the same projection (coordinate reference system) so we reproject one of them. We choose the raster's UTM.

In [None]:
bore_locations = bore_locations_raw.to_crs(dem.crs)

For this location we actually had to download two data sets from the NGIS: the data for the murrumbidgee catchment does not include much of the ones also inside the ACT.

In [None]:
lithology_logs_act = pd.read_csv(os.path.join(act_shp_datadir, 'NGIS_LithologyLog.csv'))
lithology_logs_bidgee = pd.read_csv(os.path.join(bidgee_shp_datadir, 'NGIS_LithologyLog.csv'))

In [None]:
len(lithology_logs_act), len(lithology_logs_bidgee)

In [None]:
lithology_logs = pd.concat([lithology_logs_act, lithology_logs_bidgee])

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
show(dem,title='Canberra', cmap='terrain',  ax=ax)
bore_locations.plot(ax=ax, facecolor='black')

Let's create a copy of the logs merged, so that we can fall back on to the original one if we mess things up

In [None]:
df = lithology_logs.copy()

In [None]:
df.columns

In [None]:
# These are probably the defaults from the ela package imports, but to be explicit:
DEPTH_FROM_COL = 'FromDepth'
DEPTH_TO_COL = 'ToDepth'

TOP_ELEV_COL = 'TopElev'
BOTTOM_ELEV_COL = 'BottomElev'

LITHO_DESC_COL = 'Description'
HYDRO_CODE_COL = 'HydroCode'

HYDRO_ID_COL = 'HydroID'
BORE_ID_COL = 'BoreID'

We suspect that there are locations registered for which there is actually no lithology logs recorded. We want to keep boreholes that have at least one row in the lithology logs.

TODO: this should be a feature in the package.

In [None]:
df_ids = set(df[BORE_ID_COL].values)
geolog_ids = set(bore_locations[HYDRO_ID_COL].values)

In [None]:
len(df_ids), len(geolog_ids)

In [None]:
keep = df_ids.intersection(geolog_ids)

In [None]:
s = bore_locations[HYDRO_ID_COL]

In [None]:
bore_locations = bore_locations[s.isin(keep)]

Visually we do have indeed a few less bores:

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
show(dem,title='Canberra', cmap='terrain',  ax=ax)
bore_locations.plot(ax=ax, facecolor='black')

### Subset further to a location of interest

Here, we devised how we could reduce the area further for the purpose of a case study as small as possible for submission to a gallery (pyvista). However we ended up with not enough classified bores and missing data everywhere. Selecting data sets size  with enough data is needed. Tricky.


In [None]:
    
# max/min bounds
shp_bbox = get_bbox(bore_locations)
shp_bbox

In [None]:
raster_bbox = dem.bounds
raster_bbox

In [None]:
x_min,x_max,y_min,y_max  = intersecting_bounds([shp_bbox, raster_bbox])

In [None]:
trial = cookie_cut_gpd(bore_locations, x_min, x_max, y_min, y_max)

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
show(dem,title='Canberra', cmap='terrain',  ax=ax)
trial.plot(ax=ax, facecolor='black')

In [None]:
# Tried to use only a further subset but there is not enough data to do the interpolation (too many "none" descriptions)
# Parking this for now.

# bore_locations = trial
# shp_bbox = get_bbox(trial)

# x_min = shp_bbox[0]
# x_max = shp_bbox[2]
# y_min = shp_bbox[1]
# y_max = shp_bbox[3]

### Merging the geolocation from the shapefile and lithology records

The geopandas data frame has a column geometry listing `POINT` objects. 'ela' includes  `get_coords_from_gpd_shape` to extrace the coordinates to a simpler structure. 'ela' has predefined column names (e.g. EASTING_COL) defined for easting/northing information, that we can use to name our coordinate information.

In [None]:
bore_locations.columns

In [None]:
def get_geoloc_df(bore_locations, additional_columns):
    geoloc = get_coords_from_gpd_shape(bore_locations, colname='geometry', out_colnames=[EASTING_COL, NORTHING_COL])
    for cn in additional_columns:
        geoloc[cn] = bore_locations[cn].values #important to remove indexing otherwise conterintuitive behavior (NaN)
    return geoloc
        
geoloc = get_geoloc_df(bore_locations, ['Latitude', 'Longitude', HYDRO_ID_COL])

In [None]:
geoloc.info()

In [None]:
# to be reused in experimental notebooks:
geoloc_filename = os.path.join(cbr_datadir_out,'geoloc.pkl')
if not os.path.exists(geoloc_filename):
    geoloc.to_pickle(geoloc_filename)

In [None]:
# geoloc.to_pickle(geoloc_filename)
# geoloc.to_csv(os.path.join(cbr_datadir_out,'geoloc.csv'))

In [None]:
geoloc[HYDRO_ID_COL].dtype, df[BORE_ID_COL].dtype

In [None]:
df = pd.merge(df, geoloc, how='inner', left_on=BORE_ID_COL, right_on=HYDRO_ID_COL, sort=False, copy=True, indicator=False, validate=None)

In [None]:
len(df)

In [None]:
df.head()

### Round up 'depth to' and 'depth from' columns

We round the depth related columns to the upper integer value and drop the entries where the resulting depths have degenerated to 0. `ela` has a class `DepthsRounding` to facilitate this operations on lithology records with varying column names.

We first clean up height/depths columns to make sure they are numeric.

In [None]:
# TODO: function in the package
def as_numeric(x):
    if isinstance(x, float):
        return x
    if x == 'None':
        return np.nan
    elif x is None:
        return np.nan
    elif isinstance(x, str):
        return float(x)
    else:
        return float(x)

In [None]:
df[DEPTH_FROM_COL] = df[DEPTH_FROM_COL].apply(as_numeric)
df[DEPTH_TO_COL] = df[DEPTH_TO_COL].apply(as_numeric)
df[TOP_ELEV_COL] = df[TOP_ELEV_COL].apply(as_numeric)
df[BOTTOM_ELEV_COL] = df[BOTTOM_ELEV_COL].apply(as_numeric)

In [None]:
dr = DepthsRounding(DEPTH_FROM_COL, DEPTH_TO_COL)

In [None]:
"Before rounding heights we have " + str(len(df)) + " records"

In [None]:
df = dr.round_to_metre_depths(df, np.round, True)
"After removing thin sliced entries of less than a metre, we are left with " + str(len(df)) + " records left"

## Exploring the descriptive lithology 

In [None]:
descs = df[LITHO_DESC_COL]
descs = descs.reset_index()
descs = descs[LITHO_DESC_COL]
descs.head()

The description column as read seems to be objects. Other columns seem to be objects when they should be numeric. We define two functions to clean these.

In [None]:
def clean_desc(x):
    if isinstance(x, float):
        return u''
    elif x is None:
        return u''
    else:
        # python2 return unicode(x)        
        return x

In [None]:
y = [clean_desc(x) for x in descs]

In [None]:
from striplog import Lexicon
lex = Lexicon.default()

In [None]:
y = clean_lithology_descriptions(y, lex)

We get a flat list of all the "tokens" but remove stop words ('s', 'the' and the like)

In [None]:
y = v_lower(y)
vt = v_word_tokenize(y)
flat = np.concatenate(vt)

In [None]:
import nltk
from nltk.corpus import stopwords

In [None]:
stoplist = stopwords.words('english')
exclude = stoplist + ['.',',',';',':','(',')','-']
flat = [word for word in flat if word not in exclude]

In [None]:
len(set(flat))

In [None]:
df_most_common= token_freq(flat, 50)

In [None]:
plot_freq(df_most_common)

In [None]:
df_most_common

## Defining lithology classes and finding primary/secondary lithologies

From the list of most common tokens, we may want to define lithology classes as follows:

In [None]:
df[LITHO_DESC_COL] = y

In [None]:
lithologies = [        'shale', 'clay','granite','soil','sand', 'porphyry','siltstone', 'dacite', 'gravel', 'limestone']
# Prep for visualisation
lithology_color_names = ['lightslategrey', 'olive', 'dimgray', 'chocolate',  'gold', 'tomato', 'teal', 'darkgrey', 'lavender', 'yellow']

In [None]:
# more classes for display of raw logs
lithologies = ['shale', 'clay','granite','soil','sand', 'porphyry','siltstone', 'dacite', 'rhyodacite', 'gravel', 'limestone', 'sandstone', 'slate', 'mudstone', 'rock', 'ignimbrite', 'tuff']
# Prep for visualisation
lithology_color_names = [
    'lightslategrey', # Shale
    'olive', # clay
    'dimgray', # granite
    'chocolate',  # soil
    'gold', # sand
    'tomato', # porphyry
    'teal', # siltstone
    'darkgrey', # dacite
    'whitesmoke', # rhyodacite
    'powderblue', # gravel 
    'yellow', #limestone
    'papayawhip', #sandstone
    'dimgray', #slate
    'darkred', #mudstone
    'grey', #rock
    'khaki', #ignimbrite
    'lemonchiffon' #tuff
]

And to capture any of these we devise a regular expression:

In [None]:
my_lithologies_numclasses = create_numeric_classes(lithologies)

In [None]:
lithologies_dict = dict([(x,x) for x in lithologies])
# Plurals do occur
lithologies_dict['clays'] = 'clay'
lithologies_dict['sands'] = 'sand'
lithologies_dict['shales'] = 'shale'


# lithologies_dict['dacite'] = 'granite'
# lithologies_dict['sandstone'] = 'granite'
# lithologies_dict['slate'] = 'granite'
# lithologies_dict['rock'] = 'granite'
# lithologies_dict['ryodacite'] = 'granite'
# lithologies_dict['mudstone'] = 'sand' # ??
lithologies_dict['topsoil'] = 'soil' # ??

In [None]:
any_litho_markers_re = r'shale|clay|granit|soil|sand|porphy|silt|gravel|dacit|slat|rock|stone|slate|brite|tuff'
regex = re.compile(any_litho_markers_re)

In [None]:
lithologies_adjective_dict = {
    'sandy' :  'sand',
    'clayey' :  'clay',
    'clayish' :  'clay',
    'shaley' :  'shale',
    'silty' :  'silt',
    'pebbly' :  'pebble',
    'gravelly' :  'gravel',
    'porphyritic': 'porphyry'
}

In [None]:
v_tokens = v_word_tokenize(y)
litho_terms_detected = v_find_litho_markers(v_tokens, regex=regex)

Let's see if we detect these lithology markers in each bore log entries  

In [None]:
zero_mark = [x for x in litho_terms_detected if len(x) == 0 ]
at_least_one_mark = [x for x in litho_terms_detected if len(x) >= 1]
at_least_two_mark = [x for x in litho_terms_detected if len(x) >= 2]
print('There are %s entries with no marker, %s entries with at least one, %s with at least two'%(len(zero_mark),len(at_least_one_mark),len(at_least_two_mark)))

Note: probably need to think of precanned facilities in ela to assess the detection rate in such EDA. Maybe wordcloud not such a bad idea too.

In [None]:
descs_zero_mark = [y[i] for i in range(len(litho_terms_detected)) if len(litho_terms_detected[i]) == 0 ]

In [None]:
import random
random.sample(descs_zero_mark,20)
# descs_zero_mark[1:20]

In [None]:
flat = flat_list_tokens(descs_zero_mark)

In [None]:
s = ' '.join(flat)

In [None]:
show_wordcloud(s, title = 'Unclassified via regexp')

In [None]:
primary_litho = v_find_primary_lithology(litho_terms_detected, lithologies_dict)
secondary_litho = v_find_secondary_lithology(litho_terms_detected, primary_litho, lithologies_adjective_dict, lithologies_dict)

In [None]:
df[PRIMARY_LITHO_COL]=primary_litho
df[SECONDARY_LITHO_COL]=secondary_litho

In [None]:
df[PRIMARY_LITHO_NUM_COL] = v_to_litho_class_num(primary_litho, my_lithologies_numclasses)
df[SECONDARY_LITHO_NUM_COL] = v_to_litho_class_num(secondary_litho, my_lithologies_numclasses)

## Converting depth below ground to Australian Height Datum elevation

While the bore entries have columns for AHD elevations, many appear to be missing data. Since we have a DEM of the region we can correct this.

In [None]:
cd = HeightDatumConverter(dem)

In [None]:
df = cd.add_height(df, 
        depth_from_col=DEPTH_FROM_COL, depth_to_col=DEPTH_TO_COL, 
        depth_from_ahd_col=DEPTH_FROM_AHD_COL, depth_to_ahd_col=DEPTH_TO_AHD_COL, 
        easting_col=EASTING_COL, northing_col=NORTHING_COL, drop_na=False)

In [None]:
df.info()

In [None]:
# to be reused in experimental notebooks:
classified_logs_filename = os.path.join(cbr_datadir_out,'classified_logs.pkl')
if write_outputs or not os.path.exists(classified_logs_filename):
    df.to_pickle(classified_logs_filename)


In [None]:
# df.to_pickle(classified_logs_filename)
# df.to_csv(os.path.join(cbr_datadir_out,'classified_logs.csv'))

In [None]:
classified_logs_filename = os.path.join(cbr_datadir_out,'classified_logs.csv')
df_subset = df[[HYDRO_ID_COL, BORE_ID_COL, DEPTH_FROM_COL, DEPTH_TO_COL, LITHO_DESC_COL, 'Lithology_1', 'MajorLithCode']]
# df_subset.to_csv(classified_logs_filename)


## Interpolate over a regular grid


In [None]:
df

In [None]:
grid_res = 200
m = create_meshgrid_cartesian(x_min, x_max, y_min, y_max, grid_res)

In [None]:
dem_array = surface_array(dem, x_min, y_min, x_max, y_max, grid_res)

In [None]:
dem_array[dem_array <= 0.0] = np.nan

In [None]:
dem_array_data = {'bounds': (x_min, x_max, y_min, y_max), 'grid_res': grid_res, 'mesh_xy': m, 'dem_array': dem_array}

In [None]:
import pickle

fp = os.path.join(cbr_datadir_out, 'dem_array_data.pkl')
if write_outputs or not os.path.exists(fp):
    with open(fp, 'wb') as handle:
        pickle.dump(dem_array_data, handle, protocol=pickle.HIGHEST_PROTOCOL)

We need to define min and max heights on the Z axis for which we interoplate. We use the KNN algorithm with 10 neighbours. We should use a domain such that there are enough points for each height. Let's find visually heights with at least 10 records

In [None]:
df.info()

In [None]:
dc = DepthCoverage(df)

In [None]:
r, cc = dc.get_counts()

In [None]:
plt.plot(r, cc)

In [None]:
r = dc.get_range(11)
r

In [None]:
n_neighbours=10
ahd_min=int(r[0])
ahd_max=int(r[1])

z_ahd_coords = np.arange(ahd_min,ahd_max,1)
dim_x,dim_y = m[0].shape
dim_z = len(z_ahd_coords)
dims = (dim_x,dim_y,dim_z)

In [None]:
dims

In [None]:
lithology_3d_array=np.empty(dims)

In [None]:
gi = GridInterpolation(easting_col=EASTING_COL, northing_col=NORTHING_COL)

In [None]:
gi.get_lithology_observations_for_depth(df, ahd_max, 'Depth From (AHD)')

In [None]:
len(df)

In [None]:
gi.interpolate_volume(lithology_3d_array, df, PRIMARY_LITHO_NUM_COL, z_ahd_coords, n_neighbours, m)

In [None]:
# Burn DEM into grid
z_index_for_ahd = z_index_for_ahd_functor(b=-ahd_min)

In [None]:
dem_array.shape, m[0].shape, lithology_3d_array.shape

In [None]:
burn_volume(lithology_3d_array, dem_array, z_index_for_ahd, below=False)

In [None]:
# to be reused in experimental notebooks:
interp_litho_filename = os.path.join(cbr_datadir_out,'3d_primary_litho.pkl')
if write_outputs or not os.path.exists(interp_litho_filename):
    with open(interp_litho_filename, 'wb') as handle:
        pickle.dump(lithology_3d_array, handle, protocol=pickle.HIGHEST_PROTOCOL)