## Add an OTN project to the IPT as an Occurrence/Event/eMoF

This script and supporting library will take an OTN schema and map it, metadata and data to Event Core. Given credentials for an IPT, it will attempt to update the IPT entry for that schema.

In [93]:
%load_ext autoreload
%autoreload 2

In [94]:
from dbtools.load_styles import load_pygment_style
from dbtools.connect_db import get_engine, test_engine_connection
from dbtools import publish_to_obis as pub

load_pygment_style('solarized-light')

In [95]:
# a little network tunnelling magic for connecting to OTN db
engine = get_engine('../auth/otnunit_authfile.auth')
test_engine_connection(engine)

[1m[32mDatabase connection established[0m
Connection Type:[1m[34mpostgresql[0m Host:[1m[34mdb.load.oceantrack.org[0m Database:[1m[34motnunit[0m User:[1m[34mpye[0m Node:[1m[34mOTN[0m


True

## Create Event Core Fileset - Tags and Receivers Projects

For projects deploying their own receivers, we can create an Event Core set of files, showing the receiver deployment effort for the project as well as the detections of tagged animals.

Projects contributing to the detection of a project's animals are included in the receiver event list, but only for those receivers that have detected the animals for this project.

More details about the strategy for assembling the Event Core fileset, as well as the contents of the fields themselves, can be found at https://github.com/tdwg/dwc-for-biologging


In [96]:
# Make an Event Core fileset for a project
event_core_fileset = pub.create_event_core_fileset(code='NCAT', engine=engine, write_files=True)

Getting acoustic occurrences for capture data...OK
Grabbing all detections
    otn_detections_2007...OK
    otn_detections_2008...OK
    otn_detections_2009...OK
    otn_detections_2010...OK
    otn_detections_2011...OK
    otn_detections_2012...OK
    otn_detections_2013...OK
    otn_detections_2014...OK
    otn_detections_2015...OK
    otn_detections_2016...OK
    otn_detections_2017...OK
    otn_detections_2018...OK
    otn_detections_2019...OK
    otn_detections_2020...OK
    otn_detections_2021...OK
    otn_detections_2022...OK
    otn_detections_2023...OK
    otn_detections_early...OK
OK: Done grabbing detections
Grabbing detection collectioncodes...OK
Filtering using the resonate filter tool...
Total detections in filtered dataframe: 335625
736 suspect detections removed
OK: Done filtering
Running decimated occurrences:
Running the ETN decimation on the filtered detections...OK



DataFrame columns are not unique, some columns will be omitted.


DataFrame columns are not unique, some columns will be omitted.



EML metadata has been written to '/home/jdpye/jupyter_files/ipy-utils-obis-dev/obisoutput/otnotnnortherncodacoustic/eml.xml'.
Creating fileset for DwC archive from the following files:
Zipped file /home/jdpye/jupyter_files/ipy-utils-obis-dev/obisoutput/otnotnnortherncodacoustic/2023-11-08-18-19-57/meta.xml
Zipped file /home/jdpye/jupyter_files/ipy-utils-obis-dev/obisoutput/otnotnnortherncodacoustic/eml.xml
Zipped file obisoutput/otnotnnortherncodacoustic/2023-11-08-18-19-57/events.csv
Zipped file obisoutput/otnotnnortherncodacoustic/2023-11-08-18-19-57/occurrences.csv
Zipped file obisoutput/otnotnnortherncodacoustic/2023-11-08-18-19-57/emofs.csv


### Explore and validate the DwC archive we just made


In [97]:
# ! pip install python-dwca-reader

from dwca.read import DwCAReader
from dwca.darwincore.utils import qualname as qn
import plotly.express as px  # ok plotly do your thing.
import ipywidgets
# mapbox_token = 

# TODO: pass the file path along from the creation step.
with DwCAReader('obisoutput/otnotnnortherncodacoustic/otnotnnortherncodacoustic_source_data.zip') as dwca:
    dwca.metadata
    print("Core type is %s" % dwca.descriptor.core.type)
    for e in dwca.descriptor.extensions:
        print('Extension %s present' % e.type)
        
    core_df = dwca.pd_read(dwca.descriptor.core.file_location)
    
    print("Core dataframe has %s rows and %s columns" % core_df.shape)
    
    for e in dwca.descriptor.extensions:
        ext_df = dwca.pd_read(e.file_location, low_memory=False) # large files with mixed-type columns :/ 
        print("Extension %s has %s rows and %s columns" % (e.type, ext_df.shape[0], ext_df.shape[1]))
        if e.type == "http://rs.tdwg.org/dwc/terms/Occurrence":
            occ_df = ext_df

Core type is http://rs.tdwg.org/dwc/terms/Event
Extension http://rs.tdwg.org/dwc/terms/Occurrence present
Extension http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact present
Core dataframe has 27587 rows and 11 columns
Extension http://rs.tdwg.org/dwc/terms/Occurrence has 28126 rows and 30 columns
Extension http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact has 1548 rows and 10 columns


In [83]:
# Get spatial and temporal extents.
# don't need multiple points at the same locations, drop duplicates
geo_df = core_df[['decimalLatitude', 'decimalLongitude']].drop_duplicates()
print("Reduced %s occurrences to %s distinct locations for the purposes of plotting geographic scope" % (len(core_df), len(geo_df)))
print("Project ranges from %s to %s" % (core_df.eventDate.min(), core_df.eventDate.max()))
fig = px.scatter_geo(geo_df, lat='decimalLatitude', lon='decimalLongitude')
fig.show()
    

Reduced 48540 occurrences to 263 distinct locations for the purposes of plotting geographic scope
Project ranges from 2009-10-09 00:46:25 to 2014-08-30 08:32:29


In [98]:
# Lonboard go!

# Tentative plan:
# Make a set of line geometry objects per individual fish 
# and plot using lonboard.
import geopandas as gpd
import numpy as np
import pandas as pd
import shapely
# from palettable.colobrewer.diverging import BrBG_10

from lonboard import viz, Map, ScatterplotLayer
import ipywidgets
from lonboard.colormap import apply_continuous_cmap

In [99]:
# if it's Event Core, merge the event/occurrence dataframes
if(dwca.descriptor.core.type == "http://rs.tdwg.org/dwc/terms/Event"):
    print("Combining Event Core location data with Occurrence extension data")
    all_df = pd.merge(core_df, occ_df, on='eventID')
elif(dwca.descriptor.core.type == "http://rs.tdwg.org/dwc/terms/Occurrence"):
    print("Core is Occurrence, skipping join step.")
    all_df = core_df

Combining Event Core location data with Occurrence extension data


In [100]:
all_gdf = gpd.GeoDataFrame(all_df, geometry= gpd.points_from_xy(all_df.decimalLongitude, all_df.decimalLatitude), crs=4326)

In [101]:
# What do the individual points look like, maybe scale the radius by detection counts?

map_ = viz(all_gdf, radius_min_pixels=3)

In [102]:
map_

Map(layers=[ScatterplotLayer(radius_min_pixels=3.0, table=pyarrow.Table
id: string
decimalLatitude: double
dec…

In [103]:
# make a LineString for each individual from the earliest to last date of occurrence
# order by individual and date, then group by?
# or do we make a linestring based on the existing gdf
from shapely.geometry import LineString

# first, drop all single-row elements from the gdf, linestrings hate that shit.
no_single_dets = all_gdf['organismID'].value_counts() > 1
filtered_gdf = all_gdf[all_gdf['organismID'].isin(no_single_dets[no_single_dets].index)]

# then make lines
line_gdf = filtered_gdf.sort_values(by=['organismID','eventDate']).groupby('organismID')['geometry'].apply(lambda x: LineString(x.tolist())).reset_index()

In [106]:
line_gdf

Unnamed: 0,organismID,geometry
0,NCAT-1282890-2019-07-09,"LINESTRING (-53.13870 48.70990, -53.06620 48.7..."
1,NCAT-1282891-2019-07-09,"LINESTRING (-53.06620 48.70920, -53.13870 48.7..."
2,NCAT-1282892-2019-07-09,"LINESTRING (-53.13870 48.70990, -53.13870 48.7..."
3,NCAT-1282894-2019-07-08,"LINESTRING (-53.06620 48.70920, -53.06620 48.7..."
4,NCAT-1282895-2019-07-08,"LINESTRING (-52.99760 48.43530, -52.99760 48.4..."
...,...,...
343,NCAT-1287713-2019-07-23,"LINESTRING (-52.58617 47.53633, -52.58617 47.5..."
344,NCAT-1287714-2019-07-23,"LINESTRING (-52.09991 53.66607, -52.13978 53.0..."
345,NCAT-1287715-2019-07-23,"LINESTRING (-52.77600 48.16470, -52.77600 48.1..."
346,NCAT-1287716-2019-07-23,"LINESTRING (-52.63290 47.46870, -52.63290 47.4..."


In [104]:
from lonboard import Map, PathLayer
layer = PathLayer.from_geopandas(line_gdf, auto_downcast=False, width_min_pixels=1)
map_ = Map(layers=[layer])
map_

Map(layers=[PathLayer(table=pyarrow.Table
organismID: string
geometry: list<item: fixed_size_list<item: double…

In [107]:
## Isolate an individual

from lonboard import Map, PathLayer
indiv='NCAT-1287714-2019-07-23'
layer = PathLayer.from_geopandas(line_gdf[line_gdf['organismID']==indiv], auto_downcast=False, width_min_pixels=1)
map_ = Map(layers=[layer])
print(all_gdf[((all_gdf['organismID']==indiv) & (all_gdf['basisOfRecord']=='HumanObservation'))])
map_

Empty GeoDataFrame
Columns: [id, decimalLatitude, decimalLongitude, eventID, footprintWKT, eventDate, geodeticDatum, fieldNotes, parentEventID, maximumDepthInMeters, coordinateUncertaintyInMeters, coreid, license, rightsHolder, accessRights, institutionID, datasetID, institutionCode, collectionCode, datasetName, basisOfRecord, occurrenceID, catalogNumber, recordedBy, individualCount, sex, lifeStage, occurrenceStatus, occurrenceRemarks, organismID, organismName, organismScope, startDayOfYear, endDayOfYear, samplingProtocol, locality, scientificNameID, scientificName, kingdom, vernacularName, geometry]
Index: []

[0 rows x 41 columns]


Map(layers=[PathLayer(table=pyarrow.Table
organismID: string
__index_level_0__: int64
geometry: list<item: fix…

### 

### Close the engine

In [None]:
# Drop the database connection afterward
engine.dispose()

In [None]:
# TODO: Add a Frictionless Data schema

In [78]:
# Look at the ATN proposed data format

atn_df = pd.read_csv('../input/atn_45866_occurrence.csv')
atn_gdf = gpd.GeoDataFrame(atn_df, geometry= gpd.points_from_xy(atn_df.decimalLongitude, atn_df.decimalLatitude), crs=4326)

In [58]:
map_ = viz(atn_gdf, radius_min_pixels=3)

In [59]:
map_

Map(layers=[ScatterplotLayer(radius_min_pixels=3.0, table=pyarrow.Table
basisOfRecord: string
organismID: stri…

In [79]:
# make a LineString for each individual from the earliest to last date of occurrence
# order by individual and date, then group by?
# or do we make a linestring based on the existing gdf
from shapely.geometry import LineString

# first, drop all single-row elements from the gdf, linestrings hate that shit.
atn_no_single_dets = atn_gdf['organismID'].value_counts() > 1
atn_filtered_gdf = atn_gdf[atn_gdf['organismID'].isin(atn_no_single_dets[atn_no_single_dets].index)]

# then make lines
atn_line_gdf = atn_filtered_gdf.sort_values(by=['organismID','eventDate']).groupby('organismID')['geometry'].apply(lambda x: LineString(x.tolist())).reset_index()

In [80]:
atn_line_gdf

Unnamed: 0,organismID,geometry
0,105838_great_white_shark,"LINESTRING (-118.56000 34.03000, -166.18000 23..."


In [81]:
from lonboard import Map, PathLayer
layer = PathLayer.from_geopandas(atn_line_gdf, auto_downcast=False, width_min_pixels=1)
map_ = Map(layers=[layer])
map_

Map(layers=[PathLayer(table=pyarrow.Table
organismID: string
geometry: list<item: fixed_size_list<item: double…