# Data Exploration
This notebook is to provide the coach with some ideas on exploring the data

## Dependencies

Some handy libraries
- ``geopandas``  https://geopandas.org/ to work with geojson files
- ``contextily`` https://contextily.readthedocs.io/ to render map tile backgrounds

In [None]:
%pip install geopandas contextily

Locations of our raw files in OneLake, either downloaded or from ``resources.zip``. See ``Solution - Data Engineering`` notebook for details.

In [None]:
import geopandas as gp

rawFilesFolder = "/lakehouse/default/Files/Raw/"

marineZonesRawFolder = f"{rawFilesFolder}BOM/IDM00003"
marineZonesRawFile = f"{marineZonesRawFolder}/IDM00003.shp"

shipwrecksRawFolder = f"{rawFilesFolder}WAM"
shipwrecksRawFile = f"{shipwrecksRawFolder}/Shipwrecks_WAM_002_WA_GDA94_Public.geojson"

## Basic Exploration
First let's load our data from Raw.

In [None]:
df_shipwrecks = gp.read_file(shipwrecksRawFile)
df_marineZones = gp.read_file(marineZonesRawFile)

Let's look at the data

In [None]:
df_shipwrecks.head()

In [None]:
df_marineZones.head()

Spatial data uses a Co-ordinate Reference System to define co-ordinate space to geo space translation, let's check ours

In [None]:
df_shipwrecks.crs

In [None]:
df_marineZones.crs

EPSG:4326 is WGS84, World Geodetic System 1984 as used in GPS. EPSG:4283 is GDA94, the Geocentric Datum of Australia 1994

Let's normalise these to EPSG:3857 commonly used by web mapping tools (aka Web Mercator)

In [None]:
df_shipwrecks = df_shipwrecks.to_crs('epsg:3857')
df_marineZones = df_marineZones.to_crs('epsg:3857')
print(df_shipwrecks.crs)
print(df_marineZones.crs)

## Basic Plots
We're working with spatial data, so let's make some plots.

In [None]:
df_shipwrecks.plot()

In [None]:
df_marineZones.plot()

Let's make them a little more fancy

In [None]:
df_marineZones.plot(figsize=(10,10), alpha=0.3, edgecolor="k", column='DIST_NAME', categorical=True, legend=False)

We can overlay our ``shipwrecks``:

In [None]:
ax = df_marineZones.plot(figsize=(10,10), alpha=0.3, edgecolor="k", column='DIST_NAME', categorical=True, legend=False)
df_shipwrecks.plot(ax=ax, color='r')

We're only interested in Western Australian shipwrecks. We could clip the ``marineZones`` to the ``bounds`` of the ``shipwrecks`` but we saw earlier, that ``marineZones`` contains a ``STATE_CODE`` column so let's use that

In [None]:
df_marineZones = df_marineZones[df_marineZones.STATE_CODE == "WA"]

Let's also colour by marine zone ``DIST_NAME`` (district name)

In [None]:
ax = df_marineZones.plot(figsize=(10,10), alpha=0.3, edgecolor="k", column='DIST_NAME', categorical=True, legend=False)
df_shipwrecks.plot(ax=ax, color='r')

Nice.

Let's add a base layer using ``contextily``

In [None]:
import contextily as cx

ax = df_marineZones.plot(figsize=(25,25), alpha=0.3, edgecolor="k", column='DIST_NAME', categorical=True, legend=False)
df_shipwrecks.plot(ax=ax, color='r')
cx.add_basemap(ax)

We can already see some wrecks don't fall within a marine zone (top left - Cocos Keeling Islands )

In [None]:
df_shipwrecks[df_shipwrecks.long == df_shipwrecks['long'].min()]

We need to do a spatial anti-join to find all those shipwrecks outside of a marinezone. Unfortunately, ``geopandas`` doesn't yet support anti-joins, but we can fake it with a left outer join and a filter on the geometry from the right side. We need to clone the original right side geometry, as this is dropped during the join.

In [None]:
#Deep copy our marineZones and duplicate the geometry
df_marineZones_tmp = df_marineZones.copy()
df_marineZones_tmp["right_geometry"] = df_marineZones_tmp["geometry"]

Now we can use ``sjoin`` to spatially join our data, then filter

In [None]:
df_joined = df_shipwrecks.sjoin(df_marineZones_tmp, how="left")
df_nozone = df_joined.query("index_right != index_right") # fake anti-join

Let's plot this. Red are wrecks with no marine zone.

In [None]:
ax = df_marineZones.plot(figsize=(25,25), alpha=0.3, edgecolor="k", column='DIST_NAME', categorical=True, legend=False)
df_joined.plot(ax=ax, color='b', alpha=0.3)
df_nozone.plot(ax=ax, color='r', alpha=1.0)
cx.add_basemap(ax)

Let's zoom in and look at an example, the SS Omeo, just off the shore in Perth (it's a great snorkel site, you can literally walk off the beach and be on the wreck)

Perth is marine zone WA_MW015, let's filter and plot.



In [None]:
#The Omeo and Perth maritime zone
df_omeo = df_shipwrecks[df_shipwrecks.name=='Omeo']
df_perth = df_marineZones[df_marineZones.AAC == 'WA_MW015']

#Bounding box of the Omeo
xmin, ymin, xmax, ymax = df_omeo.total_bounds
# Padding - 200m
pad=200

#Plot
ax = df_omeo.plot(figsize=(5,5),color='r')
df_perth.plot(ax=ax)

#Now set out plot limits
ax.set_xlim(xmin-pad, xmax+pad)
ax.set_ylim(ymin-pad, ymax+pad)

#Add in our basemap
cx.add_basemap(ax)


So we can see the marine zone doesn't follow the coastline, and the SS Omeo falls outside of this geo. We can't do a simple ``contains`` or ``intersects`` spatial join to place our wrecks in their relevant marine zone, we need a different kind of join. 

 ``geopandas`` supports nearest joins so we can look for the zone closest to each wreck, and remove any that are arbitrarily distant.

In [None]:
df_joined = df_shipwrecks.sjoin_nearest(df_marineZones, how="left", distance_col="distance").query("distance < 5000")

ax = df_marineZones.plot(figsize=(10,10), alpha=0.3, edgecolor="k", column='DIST_NAME', categorical=True, legend=False)
df_joined.plot(ax=ax, color='r', alpha=0.5)
cx.add_basemap(ax)

Let's just make sure all our wrecks have a zone as we'll be using this as a foreign key to Forecasts when we come to build our reports

In [None]:
df_joined[df_joined['AAC'].isnull()]


We can now apply our new found knowledge to load, clean and write our data - see ``Solution - Data Engineering`` notebook for details.