# Geospatial Indexing

- Creators: Rudy Klucik<sup>1</sup>, Veronica Martinez<sup>1</sup>, Charles Anderson<sup>1</sup>, and Carrie Wall<sup>1</sup>
- Affiliations: <sup>1</sup>Cooperative Institute for Research in Environmental Sciences ([CIRES](https://cires.colorado.edu/))
- History:
    - Version 1, 2024-09-27

## Overview
This notebook demonstrates a workflow to read timestamps from water column sonar data in Zarr format and convert that data to a GeoJSON format for mapping and GIS. The GeoJSON linestring will define the path of the ship as Sv data was recorded.

## Definitions
- Sv: water column sonar volume backscattering strength (Sv dB re 1 m-1)
- [NMEA](https://en.wikipedia.org/wiki/NMEA_0183): a data specification for communication between marine electronics including echo sounders, sonars, and GPS
- Datagram: the binary storage of data inside a file

## Prerequisites
To successfully navigate and use this notebook, you should be familiar with:

- the basics of Python programming such as loading modules, assigning variables, and list/array indexing
- plotting data
- using `xarray` to access the data

## Learning Outcomes
By working through this notebook, you will learn how to:

- extract navigation data from a Zarr store
- use GeoPandas to organize geospatial data into GeoJSON format
- plot geospatial data in a mapping interface

## Time Estimate
- Estimated text reading time: 4 to 8 min
- Estimated code reading time: 2 to 4 min
- Estimated total reading time: 6 to 12 min

## Software
This tutorial uses the Python programming language and packages. We will use:

- `Boto3` to access data from an S3 bucket
- `Zarr` to work with cloud native files
- `Xarray` to work with Zarr files and for data analysis
- `Numpy` for simple array operations
- `Pandas` for creating a dataframe
- `GeoPandas` for creating a geospatial dataframe
- [`Folium`](https://geopandas.org/en/stable/gallery/plotting_with_folium.html) to plot data

## Installing and importing libraries

We will first need to install a couple python libraries for accessing and processing data. The following code check if the libraries are already installed in your environment and install the missing libraries.

In [None]:
# sys provides access to system level information (i.e., the executable for the Python installation)
import sys
# subprocess makes system level calls
import subprocess

# List of packages used in this notebook
PACKAGES = ["numpy", "pandas", "geopandas", "matplotlib", "xarray", "zarr", "folium", "boto3", "s3fs"]

# Loop through each package to either import or install
for package in PACKAGES:
    try:
        # First, attempt to import the package
        __import__(package)
    except ImportError:
        # If package import is unsuccessful, install using pip
        # The command structure is <python executable> -m pip install <package>
        if package == 'geopandas': package = 'geopandas==0.13.1'
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])

In [None]:
import s3fs
import folium
import os
import zarr
import boto3
import numpy as np
import pandas as pd
import xarray as xr
import geopandas as gpd
from botocore import UNSIGNED
from botocore.config import Config

import matplotlib.pyplot as plt

## Access data

Data is freely available from the NCEI archives and can be accessed from an AWS S3 bucket. Use boto3 to download a data file.

`s3fs` can be used to create a pythonic filesystem interface for S3 for easier navigation of the bucket.

In [None]:
# Connect to S3 bucket. Use UNSIGNED to connect as an anonymous user
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

In [None]:
# raw_file = "data/raw/Henry_B._Bigelow/HB0707/EK60/D20070712-T201647.raw"

In [None]:
s3_file_system = s3fs.S3FileSystem(anon=True)

In [None]:
bucket_name = 'noaa-wcsd-zarr-pds'
ship_name = 'Bell_M._Shimada'
cruise_name = 'SH1507'
sensor_name = 'EK60'
zarr_store = 'SH1507.zarr'
s3_zarr_store_path = f"{bucket_name}/level_2/{ship_name}/{cruise_name}/{sensor_name}/{zarr_store}"

In [None]:
store = s3fs.S3Map(root=s3_zarr_store_path, s3=s3_file_system, check=False)

In [None]:
cruise = xr.open_zarr(store=store, consolidated=None)
cruise

## Extract spatial and temporal data from the Zarr store
The GPS coordinates and time stamps can be accessed from the 'cruise'. The 'time' is a coordinate for the underlying data while 'latitude' and 'longitude' are data variables with the same dimension. Essentially, for every vertical measurement of water-column sonar data there is an associated timestamp, a latitude, and a longitude. Each variable can be accessed by name to the Xarray DataArray, e.g. 'cruise.time', or via 'cruise.time.values' to access just the data.

The time DataArray values can be accessed by the coordinate name, 'time' as follows:

In [None]:
cruise.time

The latitude DataArray values can be accessed by name as follows:

In [None]:
cruise.latitude

Longitude values similarly:

In [None]:
cruise.longitude

Write the gps_df linestring to geojson. GeoJSON is a format for encoding a variety of geographic data structures

### Creating a GeoPandas DataFrame

The latitude, longitude, and timestamps can be combined to create a GeoPandas dataframe as follows:

In [None]:
# Start by creating a pandas dataframe containing lat, lon, and time for the cruise
gps_df = pd.DataFrame({'latitude': cruise.latitude.values, 'longitude': cruise.longitude.values, 'time': cruise.time.values}).set_index(['time'])

gps_df

Create a GeoPandas FeatureCollection indexed by time (missing values are dropped). GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.

In [None]:
gps_gdf = gpd.GeoDataFrame(
    gps_df,
    geometry=gpd.points_from_xy(gps_df['longitude'], gps_df['latitude']),
    crs="epsg:4326"
).dropna()

gps_gdf

The GeoPandas dataframe can be converted to a GeoJSON format for serialization:

In [None]:
geojson = gps_gdf.to_json()
geojson

### Plotting the Trackline

Next we create a geospatial plot of the trackline using GeoPanda's built-in plotting function.

In [None]:
gps_gdf.plot()

Note the uncertainty assoicated with GPS measurements as the ship moves along its path.

### Converting Points to a Simplified Linestring

Plotting the linestring as below with gps_gdf.explore().

In [None]:
gps_gdf.index = gps_gdf.index.astype(str)

In [None]:
%%capture
!pip install mapclassify
import mapclassify as mc

In [None]:
gps_gdf.explore()

### Creating a GeoJSON Linestring

Combining all the latitude and longitude values we can create a geometry linestring.

In [None]:
import shapely.geometry as geom

In [None]:
linestring = geom.LineString( [xy for xy in zip(gps_gdf.latitude, gps_gdf.longitude)] )

In [None]:
#linestring = geom.LineString(gps_gdf['geometry'])

In [None]:
len(linestring.coords)

In [None]:
linestring.coords[0]

### Simplification

The geometry contains a lot of points to specify the path of the ship's journey. This geometry can simplified using the [Ramer–Douglas–Peucker algorithm](https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm). This algorithm finds a geometry that travels the same path and can do so when specified with a single tolerance value, here that value is set to "0.01."

In [None]:
lineSimplified = linestring.simplify(tolerance=0.01, preserve_topology=True)

Note how the total number of coordinates needed to specify the linestring drops significantly based on the tolerance value.

In [None]:
print(f"Total number of points for original linestring: {len(linestring.coords)}")

In [None]:
print(f"Total number of points needed for the simplified linestring: {len(lineSimplified.coords)}")

In [None]:
from IPython.display import display

### Plotting the data interactively

We can start by roughly calculating the centroid of the area of interest.

In [None]:
centroid = gps_gdf.geometry.centroid

Next extract the coordinates of the centroid (latitude, longitude).

In [None]:
center_latitude = centroid.y.iloc[0]
center_longitude = centroid.x.iloc[0]

And finally plot the simplified PolyLine of the ships movement throughout the duration of the cruise.

In [None]:
m = folium.Map(location=[center_latitude, center_longitude], zoom_start=10)

folium.PolyLine(lineSimplified.coords).add_to(m)

In [None]:
display(m)

# Data Statement
All data used in this notebook are publicly available

The Level 2 data can be found here:

*   https://noaa-wcsd-zarr-pds.s3.us-east-1.amazonaws.com/level_2/Bell_M._Shimada/SH1507/EK60/SH1507.zarr/


The files can be explored by navigating to the following AWS file explorer:

*   https://noaa-wcsd-zarr-pds.s3.amazonaws.com/index.html#level_2/Bell_M._Shimada/SH1507/EK60/SH1507.zarr/


# Acknowledgments
Funding support was provided by the NOAA Center for Artificial Intelligence ([NCAI](https://www.noaa.gov/noaa-center-for-artificial-intelligence/)) and [NOAA Fisheries](https://www.fisheries.noaa.gov/)

# Metadata
- Language / packages(s):
 - Language: Python
 - Packages: Boto, Xarray, Zarr, Numpy, Pandas, GeoPandas

- Scientific domain:
 - Fisheries acoustics

- Application keywords:
 - Sonar processing

- Geophysical keywords
 - Spatial location

# License
## Software and Content Description License
Software code created by U.S. Government employees is not subject to copyright in the United States (17 U.S.C. §105). The United States/Department of Commerce reserve all rights to seek and obtain copyright protection in countries other than the United States for Software authored in its entirety by the Department of Commerce. To this end, the Department of Commerce hereby grants to Recipient a royalty-free, nonexclusive license to use, copy, and create derivative works of the Software outside of the United States.

# Disclaimer
This Jupyter notebook is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA Jupyter notebooks are provided on an 'as is' basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this Jupyter notebook will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.