# DwC Events. PSZMP data

FILL THIS OUT LATER USING THE HOOD CANAL TEXT AS A TEMPLATE.  
University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.   
Alignment of dataset to Darwin Core (DwC) for NANOOS, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, are described in the GitHub repository https://github.com/nanoos-pnw/obis-keisterhczoop. See [notebooks-notes.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/notebooks-notes.md) and [README.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/README.md).   

Emilio Mayorga, https://github.com/emiliom   

2/9, 1/31/2024

## Goals and scope of this notebook

Parse the source data to define and pull out 3 event "types": `cruise`, `stationVisit` and `sample`. The DwC event table is populated sequentially for each of those event types, in that order, from the most temporally aggregated (cruise) to the most granular (sample). Columns are populated differently depending on the event type. The notebook generates the DwC "event" file `DwC_event.csv`.

## Settings

In [1]:
from datetime import datetime
import json
from pathlib import Path

import numpy as np
import pandas as pd
import geopandas as gpd

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [5]:
DatasetCode = common_mappings['datasetcode']
cruises = common_mappings['cruises']
stations = common_mappings['stations']
net_tow = common_mappings['net_tow']

iso8601_format = common_mappings['iso8601_format']
CRS = common_mappings['CRS']

## Pre-process data from csv for Event table

### Read the pre-processed csv file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
# From the Hood Canal dataset
# usecols = [
#     'sample_code', 'mesh_size', 'FWC_DS', 
#     'station', 'latitude', 'longitude', 
#     'date', 'time_start', 'time', 'day_night', 
#     'depth_min', 'depth_max',
#     'net_code', 'extra_sample_token'
# ]

usecols = [
    'Sample Code', 
    'Station', 'Latitude', 'Longitude', 'Site Name', 'Basin',
    'Sample Date', 'Sample Time', 'Day_Night', 'time',
    'Min Tow Depth (m)', 'Max Tow Depth (m)', 'Station Depth (m)',
    'Mesh Size', 'Tow Type', 
]

# eventsource_df = read_and_parse_sourcedata(test_n_rows=1000)[usecols]
eventsource_df = read_and_parse_sourcedata()[usecols]

# TODO: Rename more columns, if needed
eventsource_df.rename(
    columns={
        'Sample Code':'sample_code',
        'Station':'station',
        'Latitude':'latitude',
        'Longitude':'longitude',
        'Min Tow Depth (m)':'depth_min', 
        'Max Tow Depth (m)':'depth_max', 
        'Mesh Size': 'mesh_size',
    },
    inplace=True
)

In [7]:
len(eventsource_df)

153825

In [8]:
eventsource_df.head()

Unnamed: 0,sample_code,station,latitude,longitude,Site Name,Basin,Sample Date,Sample Time,Day_Night,time,depth_min,depth_max,Station Depth (m),mesh_size,Tow Type
0,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical
1,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical
2,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical
3,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical
4,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical


### Remove duplicates

Will return only unique samples, where one row = one sample.

In [9]:
eventsource_df = eventsource_df.drop_duplicates().sort_values(by='sample_code').reset_index(drop=True)

In [10]:
len(eventsource_df)

3567

### sample_code extra characters (extra sub-code -- see Amanda's explanation)

In [11]:
len(eventsource_df['sample_code'].unique())

3567

Note: Most sample codes in the 1000-record sample are 17 characters long, but a small subset is only 16 characters. Investigate

In [12]:
eventsource_df['sample_code'].str.len().value_counts()

14    1384
17     922
15     860
16     397
12       2
13       2
Name: sample_code, dtype: int64

## TODO: REMOVED CRUISE EVENT HANDLING

**REFER TO THE HOOD CANAL DATASET NOTEBOOK TO ADAPT THE CRUISE CODE THAT AGGREGATES STATIONS, IF NEEDED**

Use 'Sampling Group' as the grouping attribute, analogous to cruises. Set up a dictionary in `common_mappings.json` matching up the 'Sampling Group' strings to fleshed out descriptions and anything else that might be relevant.

## Create empty Event dataframe

Records from each event type will be appended to this dataframe, by "type". The type is encoded in the `eventType` column, not in the DwC `type` column, which is not used here explicitly (the type is `Event`).

In [13]:
event_cols_dtypes = np.dtype(
    [
        ('eventID', str),
        ('eventType', str), 
        ('parentEventID', str),
        ('eventDate', str), 
        ('locationID', str),
        ('locality', str),
        ('decimalLatitude', float),
        ('decimalLongitude', float),
        ('footprintWKT', str),
        ('geodeticDatum', str),
        ('waterBody', str),
        ('countryCode', str), 
        ('minimumDepthInMeters', float),
        ('maximumDepthInMeters', float),
        ('samplingProtocol', str)
    ]
)

In [14]:
event_df = pd.DataFrame(np.empty(0, dtype=event_cols_dtypes))

## Create stationVisit events

- Use cruise `eventID` from `eventsource_df` as stationVisit `parentEventID`
- Add `stationvisit_code` to `eventsource_df`, for use by the next event type (sample)

In [15]:
pd.DataFrame(eventsource_df[eventsource_df['station'] == 'KSBP01D']['sample_code'].value_counts()).head(10)

Unnamed: 0,sample_code
010322KSBP01D0815,1
090721KSBP01D0919,1
082817KSBP01D1020,1
090319KSBP01D0918,1
090418KSBP01D1338,1
090517KSBP01D0930,1
090616KSBP01D0824,1
090622KSBP01D0939,1
090820KSBP01D0922,1
100118KSBP01D0908,1


In [16]:
# eventsource_df['Sample Code'].str[0:-4].head(10)

In [17]:
# eventsource_df['Day_Night'].head(10)

For now, use `Sample Code` (dropping the last 4 characters to retain date+station) and `Day_Night` to create `stationvisit_code`

In [18]:
eventsource_df['stationvisit_code'] = (
    eventsource_df['sample_code'].str[0:-4] + eventsource_df['Day_Night']
)

In [19]:
eventsource_df.head()

Unnamed: 0,sample_code,station,latitude,longitude,Site Name,Basin,Sample Date,Sample Time,Day_Night,time,depth_min,depth_max,Station Depth (m),mesh_size,Tow Type,stationvisit_code
0,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical,010218ELIVD
1,010322KSBP01D0815,KSBP01D,47.74396,-122.4282,Point Jefferson,Central Basin,2022-01-03 00:00:00,08:15:00,D,2022-01-03 08:15:00-07:00,22.0,0.0,275.0,335,Oblique,010322KSBP01DD
2,010422LSNT01D1323,LSNT01D,47.53333,-122.4333,Point Williams,Central Basin,2022-01-04 00:00:00,13:23:00,D,2022-01-04 13:23:00-07:00,38.0,0.0,210.0,335,Oblique,010422LSNT01DD
3,010422LSNT01V1305,LSNT01V,47.53333,-122.4333,Point Williams,Central Basin,2022-01-04 00:00:00,13:05:00,D,2022-01-04 13:05:00-07:00,0.0,200.0,210.0,200,Vertical,010422LSNT01VD
4,010422NSEX01V1049,NSEX01V,47.35862,-122.3871,East Passage,Central Basin,2022-01-04 00:00:00,10:49:00,D,2022-01-04 10:49:00-07:00,0.0,170.0,180.0,200,Vertical,010422NSEX01VD


In [20]:
stationvisit_df = eventsource_df.groupby(
    ['Sample Date', 'Day_Night', 'station', 'latitude', 'longitude', 'stationvisit_code', 'Basin', 'Site Name']
).agg({
    'time':['min', 'max'],
    'depth_min':['min'],
    'depth_max':['max'],
})
stationvisit_df.columns = ["_".join(stat) for stat in stationvisit_df.columns.ravel()]
stationvisit_df = (
    stationvisit_df
    .sort_values(by='time_min')
    .reset_index(drop=False)
)

len(stationvisit_df)

  stationvisit_df.columns = ["_".join(stat) for stat in stationvisit_df.columns.ravel()]


3551

**TODO:**
- HMM, the record count above is 470, one lower than for `source_df`
- Below, some records have `depth_min_min` > `depth_max_max`! That's non-sensical. It may be due to inconsistent data entry patterns. I may needed to take a different approach: calculate `depth_min_min` as the min and `depth_max_max` as the max of either `depth_min` or `depth_max`
- In the cell below, will have to come up with fake "cruise" data to merge into `stationvisit_df`, until I create a station visit grouping analogous to cruises

In [21]:
stationvisit_df.head(10)

Unnamed: 0,Sample Date,Day_Night,station,latitude,longitude,stationvisit_code,Basin,Site Name,time_min,time_max,depth_min_min,depth_max_max
0,2014-03-25 00:00:00,D,DANAV,47.18327,-122.8307,032514DANAVD,South Sound,Dana Passage,2014-03-25 10:35:00-07:00,2014-03-25 10:35:00-07:00,0.0,40.0
1,2014-03-25 00:00:00,D,DANAS,47.17593,-122.8355,032514DANASD,South Sound,Dana Passage,2014-03-25 11:08:00-07:00,2014-03-25 11:08:00-07:00,15.0,0.0
2,2014-03-25 00:00:00,D,DANAM,47.17678,-122.8352,032514DANAMD,South Sound,Dana Passage,2014-03-25 11:26:00-07:00,2014-03-25 11:26:00-07:00,20.0,0.0
3,2014-03-25 00:00:00,D,DANAD,47.17997,-122.8318,032514DANADD,South Sound,Dana Passage,2014-03-25 11:47:00-07:00,2014-03-25 11:47:00-07:00,30.0,0.0
4,2014-04-01 00:00:00,D,SKETV,47.15243,-122.6586,040114sketvD,South Sound,South Ketron/Solo Point,2014-04-01 10:01:00-07:00,2014-04-01 10:01:00-07:00,0.0,122.0
5,2014-04-01 00:00:00,D,SKETS,47.13843,-122.6354,040114SKETSD,South Sound,South Ketron/Solo Point,2014-04-01 10:55:00-07:00,2014-04-01 10:55:00-07:00,10.0,0.0
6,2014-04-01 00:00:00,D,SKETM,47.13892,-122.6367,040114sketmD,South Sound,South Ketron/Solo Point,2014-04-01 11:20:00-07:00,2014-04-01 11:20:00-07:00,20.0,0.0
7,2014-04-01 00:00:00,D,SKETD,47.14032,-122.6377,040114sketdD,South Sound,South Ketron/Solo Point,2014-04-01 11:50:00-07:00,2014-04-01 11:50:00-07:00,30.0,0.0
8,2014-04-04 00:00:00,D,ADIV,48.00273,-122.636,040414ADIVD,Admiralty Inlet,Admiralty Inlet,2014-04-04 13:00:00-07:00,2014-04-04 13:00:00-07:00,0.0,100.0
9,2014-04-09 00:00:00,D,KSBP01V,47.74396,-122.4282,040914ksbp01D,Central Basin,Point Jefferson,2014-04-09 09:25:00-07:00,2014-04-09 09:25:00-07:00,0.0,200.0


In [22]:
# stationvisit_df = stationvisit_df.merge(
#     cruise_df[['date_yyyymm', 'eventID', 'waterBody', 'countryCode']],
#     how='left', 
#     on='date_yyyymm'
# )

# TEMPORARY HARD-WIRING
stationvisit_df['eventID'] = 'LUMPED'
stationvisit_df['waterBody'] = stationvisit_df['Basin'] + ', Puget Sound'
stationvisit_df['countryCode'] = 'US'

In [23]:
stationvisit_df.head()

Unnamed: 0,Sample Date,Day_Night,station,latitude,longitude,stationvisit_code,Basin,Site Name,time_min,time_max,depth_min_min,depth_max_max,eventID,waterBody,countryCode
0,2014-03-25 00:00:00,D,DANAV,47.18327,-122.8307,032514DANAVD,South Sound,Dana Passage,2014-03-25 10:35:00-07:00,2014-03-25 10:35:00-07:00,0.0,40.0,LUMPED,"South Sound, Puget Sound",US
1,2014-03-25 00:00:00,D,DANAS,47.17593,-122.8355,032514DANASD,South Sound,Dana Passage,2014-03-25 11:08:00-07:00,2014-03-25 11:08:00-07:00,15.0,0.0,LUMPED,"South Sound, Puget Sound",US
2,2014-03-25 00:00:00,D,DANAM,47.17678,-122.8352,032514DANAMD,South Sound,Dana Passage,2014-03-25 11:26:00-07:00,2014-03-25 11:26:00-07:00,20.0,0.0,LUMPED,"South Sound, Puget Sound",US
3,2014-03-25 00:00:00,D,DANAD,47.17997,-122.8318,032514DANADD,South Sound,Dana Passage,2014-03-25 11:47:00-07:00,2014-03-25 11:47:00-07:00,30.0,0.0,LUMPED,"South Sound, Puget Sound",US
4,2014-04-01 00:00:00,D,SKETV,47.15243,-122.6586,040114sketvD,South Sound,South Ketron/Solo Point,2014-04-01 10:01:00-07:00,2014-04-01 10:01:00-07:00,0.0,122.0,LUMPED,"South Sound, Puget Sound",US


In [24]:
stationvisit_df.rename(
    columns={
        'station':'locationID',
        'latitude':'decimalLatitude',
        'longitude':'decimalLongitude',
        'eventID':'parentEventID',
        'depth_min_min':'minimumDepthInMeters', 
        'depth_max_max':'maximumDepthInMeters',
    },
    inplace=True
)

In [25]:
# This form is for populating eventDate with an iso8601 interval
# stationvisit_df['eventDate'] = stationvisit_df[['time_min', 'time_max']].apply(
#     lambda row: "{}/{}".format(row['time_min'].strftime(iso8601_format), 
#                                row['time_max'].strftime(iso8601_format)),
#     axis=1
# )

stationvisit_df['eventDate'] = stationvisit_df.apply(
    lambda row: "{}".format(row['time_min'].strftime(iso8601_format)),
    axis=1
)

In [26]:
stationvisit_df['eventID'] = stationvisit_df['parentEventID'] + '-' + stationvisit_df['stationvisit_code']
stationvisit_df['eventType'] = 'stationVisit'
# stationvisit_df['locality'] = stationvisit_df['locationID'].apply(lambda cd: stations[cd])
stationvisit_df['locality'] = stationvisit_df['Site Name']
stationvisit_df['geodeticDatum'] = CRS

Verify that no duplicate station `eventID` values are created

In [27]:
len(stationvisit_df.eventID.unique()) == len(stationvisit_df)

True

In [28]:
stationvisit_df.head(5)

Unnamed: 0,Sample Date,Day_Night,locationID,decimalLatitude,decimalLongitude,stationvisit_code,Basin,Site Name,time_min,time_max,minimumDepthInMeters,maximumDepthInMeters,parentEventID,waterBody,countryCode,eventDate,eventID,eventType,locality,geodeticDatum
0,2014-03-25 00:00:00,D,DANAV,47.18327,-122.8307,032514DANAVD,South Sound,Dana Passage,2014-03-25 10:35:00-07:00,2014-03-25 10:35:00-07:00,0.0,40.0,LUMPED,"South Sound, Puget Sound",US,2014-03-25T10:35:00-0700,LUMPED-032514DANAVD,stationVisit,Dana Passage,EPSG:4326
1,2014-03-25 00:00:00,D,DANAS,47.17593,-122.8355,032514DANASD,South Sound,Dana Passage,2014-03-25 11:08:00-07:00,2014-03-25 11:08:00-07:00,15.0,0.0,LUMPED,"South Sound, Puget Sound",US,2014-03-25T11:08:00-0700,LUMPED-032514DANASD,stationVisit,Dana Passage,EPSG:4326
2,2014-03-25 00:00:00,D,DANAM,47.17678,-122.8352,032514DANAMD,South Sound,Dana Passage,2014-03-25 11:26:00-07:00,2014-03-25 11:26:00-07:00,20.0,0.0,LUMPED,"South Sound, Puget Sound",US,2014-03-25T11:26:00-0700,LUMPED-032514DANAMD,stationVisit,Dana Passage,EPSG:4326
3,2014-03-25 00:00:00,D,DANAD,47.17997,-122.8318,032514DANADD,South Sound,Dana Passage,2014-03-25 11:47:00-07:00,2014-03-25 11:47:00-07:00,30.0,0.0,LUMPED,"South Sound, Puget Sound",US,2014-03-25T11:47:00-0700,LUMPED-032514DANADD,stationVisit,Dana Passage,EPSG:4326
4,2014-04-01 00:00:00,D,SKETV,47.15243,-122.6586,040114sketvD,South Sound,South Ketron/Solo Point,2014-04-01 10:01:00-07:00,2014-04-01 10:01:00-07:00,0.0,122.0,LUMPED,"South Sound, Puget Sound",US,2014-04-01T10:01:00-0700,LUMPED-040114sketvD,stationVisit,South Ketron/Solo Point,EPSG:4326


### Populate (append to) the `event_df` table with the stationVisit events

In [29]:
event_df = pd.concat(
    [
        event_df,
        stationvisit_df[[
            'eventID', 'eventType', 'parentEventID', 'eventDate', 
            'decimalLatitude', 'decimalLongitude', 'geodeticDatum',
            'locationID', 'locality', 'waterBody', 'countryCode', 
            'minimumDepthInMeters', 'maximumDepthInMeters'
        ]]
    ],
    ignore_index=True
)

len(event_df)

3551

In [30]:
event_df.head(8)

Unnamed: 0,eventID,eventType,parentEventID,eventDate,locationID,locality,decimalLatitude,decimalLongitude,footprintWKT,geodeticDatum,waterBody,countryCode,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol
0,LUMPED-032514DANAVD,stationVisit,LUMPED,2014-03-25T10:35:00-0700,DANAV,Dana Passage,47.18327,-122.8307,,EPSG:4326,"South Sound, Puget Sound",US,0.0,40.0,
1,LUMPED-032514DANASD,stationVisit,LUMPED,2014-03-25T11:08:00-0700,DANAS,Dana Passage,47.17593,-122.8355,,EPSG:4326,"South Sound, Puget Sound",US,15.0,0.0,
2,LUMPED-032514DANAMD,stationVisit,LUMPED,2014-03-25T11:26:00-0700,DANAM,Dana Passage,47.17678,-122.8352,,EPSG:4326,"South Sound, Puget Sound",US,20.0,0.0,
3,LUMPED-032514DANADD,stationVisit,LUMPED,2014-03-25T11:47:00-0700,DANAD,Dana Passage,47.17997,-122.8318,,EPSG:4326,"South Sound, Puget Sound",US,30.0,0.0,
4,LUMPED-040114sketvD,stationVisit,LUMPED,2014-04-01T10:01:00-0700,SKETV,South Ketron/Solo Point,47.15243,-122.6586,,EPSG:4326,"South Sound, Puget Sound",US,0.0,122.0,
5,LUMPED-040114SKETSD,stationVisit,LUMPED,2014-04-01T10:55:00-0700,SKETS,South Ketron/Solo Point,47.13843,-122.6354,,EPSG:4326,"South Sound, Puget Sound",US,10.0,0.0,
6,LUMPED-040114sketmD,stationVisit,LUMPED,2014-04-01T11:20:00-0700,SKETM,South Ketron/Solo Point,47.13892,-122.6367,,EPSG:4326,"South Sound, Puget Sound",US,20.0,0.0,
7,LUMPED-040114sketdD,stationVisit,LUMPED,2014-04-01T11:50:00-0700,SKETD,South Ketron/Solo Point,47.14032,-122.6377,,EPSG:4326,"South Sound, Puget Sound",US,30.0,0.0,


**EXIT HERE, UNTIL I DEVELOP THE NEXT SECTION**

In [31]:
# raise UserWarning('Exit Early')

## Create individual "sample" events

- Each unique `sample_code` will be an event. `sample_code` will be the eventID, possibly prefixed by the dataset code, `UWPHHCZoop`

In [32]:
sample_df = eventsource_df.copy()
sample_df.head()

Unnamed: 0,sample_code,station,latitude,longitude,Site Name,Basin,Sample Date,Sample Time,Day_Night,time,depth_min,depth_max,Station Depth (m),mesh_size,Tow Type,stationvisit_code
0,010218ELIV1151,ELIV,48.63795,-122.5694,Eliza Island,Bellingham Bay,2018-01-02 00:00:00,11:51:00,D,2018-01-02 11:51:00-07:00,0.0,110.0,120.7,200,Vertical,010218ELIVD
1,010322KSBP01D0815,KSBP01D,47.74396,-122.4282,Point Jefferson,Central Basin,2022-01-03 00:00:00,08:15:00,D,2022-01-03 08:15:00-07:00,22.0,0.0,275.0,335,Oblique,010322KSBP01DD
2,010422LSNT01D1323,LSNT01D,47.53333,-122.4333,Point Williams,Central Basin,2022-01-04 00:00:00,13:23:00,D,2022-01-04 13:23:00-07:00,38.0,0.0,210.0,335,Oblique,010422LSNT01DD
3,010422LSNT01V1305,LSNT01V,47.53333,-122.4333,Point Williams,Central Basin,2022-01-04 00:00:00,13:05:00,D,2022-01-04 13:05:00-07:00,0.0,200.0,210.0,200,Vertical,010422LSNT01VD
4,010422NSEX01V1049,NSEX01V,47.35862,-122.3871,East Passage,Central Basin,2022-01-04 00:00:00,10:49:00,D,2022-01-04 10:49:00-07:00,0.0,170.0,180.0,200,Vertical,010422NSEX01VD


In [33]:
sample_df = sample_df.merge(
    stationvisit_df[['stationvisit_code', 'eventID', 'waterBody', 'countryCode', 
                     'locationID', 'locality', 'geodeticDatum']],
    how='left', 
    on='stationvisit_code'
)

In [34]:
sample_df= (
    sample_df
    .rename(columns={
        'sample_code':'eventID',
        'eventID':'parentEventID',
        'latitude':'decimalLatitude',
        'longitude':'decimalLongitude',
        'depth_min':'minimumDepthInMeters',
        'depth_max':'maximumDepthInMeters',
    })
    .sort_values(by='time')
    .reset_index(drop=False)
)

In [35]:
def samplingProtocol(row):
    return (
        f"{net_tow[row['FWC_DS']]} net tow using 0.25 m2 HydroBios MultiNet Multiple Plankton Sampler,"
        f" net code {row['net_code']}, {row['mesh_size']} micron mesh"
    )

sample_df['eventType'] = 'sample'
sample_df['eventDate'] = sample_df['time'].apply(lambda t: t.strftime(iso8601_format))
sample_df['samplingProtocol'] = "PLACE HOLDER"
# FOR NOW, USE A PLACEHOLDER VALUE
# sample_df['samplingProtocol'] = sample_df.apply(samplingProtocol, axis=1)

In [36]:
sample_df.head()

Unnamed: 0,index,eventID,station,decimalLatitude,decimalLongitude,Site Name,Basin,Sample Date,Sample Time,Day_Night,...,stationvisit_code,parentEventID,waterBody,countryCode,locationID,locality,geodeticDatum,eventType,eventDate,samplingProtocol
0,523,032514DANAV1035,DANAV,47.18327,-122.8307,Dana Passage,South Sound,2014-03-25 00:00:00,10:35:00,D,...,032514DANAVD,LUMPED-032514DANAVD,"South Sound, Puget Sound",US,DANAV,Dana Passage,EPSG:4326,sample,2014-03-25T10:35:00-0700,PLACE HOLDER
1,522,032514DANAS1108,DANAS,47.17593,-122.8355,Dana Passage,South Sound,2014-03-25 00:00:00,11:08:00,D,...,032514DANASD,LUMPED-032514DANASD,"South Sound, Puget Sound",US,DANAS,Dana Passage,EPSG:4326,sample,2014-03-25T11:08:00-0700,PLACE HOLDER
2,521,032514DANAM1126,DANAM,47.17678,-122.8352,Dana Passage,South Sound,2014-03-25 00:00:00,11:26:00,D,...,032514DANAMD,LUMPED-032514DANAMD,"South Sound, Puget Sound",US,DANAM,Dana Passage,EPSG:4326,sample,2014-03-25T11:26:00-0700,PLACE HOLDER
3,520,032514DANAD1147,DANAD,47.17997,-122.8318,Dana Passage,South Sound,2014-03-25 00:00:00,11:47:00,D,...,032514DANADD,LUMPED-032514DANADD,"South Sound, Puget Sound",US,DANAD,Dana Passage,EPSG:4326,sample,2014-03-25T11:47:00-0700,PLACE HOLDER
4,578,040114sketv1001,SKETV,47.15243,-122.6586,South Ketron/Solo Point,South Sound,2014-04-01 00:00:00,10:01:00,D,...,040114sketvD,LUMPED-040114sketvD,"South Sound, Puget Sound",US,SKETV,South Ketron/Solo Point,EPSG:4326,sample,2014-04-01T10:01:00-0700,PLACE HOLDER


### Populate (append to) the `event_df` table with the sample events

In [37]:
event_df = pd.concat(
    [
        event_df,
        sample_df[[
            'eventID', 'eventType', 'parentEventID', 'eventDate', 
            'locationID', 'locality', 'waterBody', 'countryCode', 
            'decimalLatitude', 'decimalLongitude', 'geodeticDatum',
            'minimumDepthInMeters', 'maximumDepthInMeters', 'samplingProtocol'
        ]]
    ],
    ignore_index=True
)

len(event_df)

7118

In [38]:
event_df.tail(10)

Unnamed: 0,eventID,eventType,parentEventID,eventDate,locationID,locality,decimalLatitude,decimalLongitude,footprintWKT,geodeticDatum,waterBody,countryCode,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol
7108,120622NSEX01V1236,sample,,2022-12-06T12:36:00-0700,,,47.35862,-122.3871,,,,,0.0,170.0,PLACE HOLDER
7109,121222Cow3V21127,sample,,2022-12-12T11:27:00-0700,,,48.67437,-123.0481,,,,,,0.0,PLACE HOLDER
7110,121222MUKV1142,sample,,2022-12-12T11:42:00-0700,,,47.97166,-122.3222,,,,,0.0,200.0,PLACE HOLDER
7111,121222CAMV1231,sample,,2022-12-12T12:31:00-0700,,,48.05901,-122.3873,,,,,0.0,175.0,PLACE HOLDER
7112,121222Wat1V1344,sample,,2022-12-12T13:44:00-0700,,,48.43457,-122.8037,,,,,0.0,30.0,PLACE HOLDER
7113,121422HCB003V1029,sample,,2022-12-14T10:29:00-0700,,,47.53787,-123.0096,,,,,0.0,139.0,PLACE HOLDER
7114,121422HCB004V1137,sample,,2022-12-14T11:37:00-0700,,,47.3562,-123.0249,,,,,0.0,48.0,PLACE HOLDER
7115,121522SARAV1137,sample,LUMPED-121522SARAVD,2022-12-15T11:37:00-0700,SARAV,North Saratoga Passage,48.25673,-122.5442,,EPSG:4326,"Whidbey Basin, Puget Sound",US,0.0,85.0,PLACE HOLDER
7116,121922ADIV1027,sample,,2022-12-19T10:27:00-0700,,,48.00273,-122.636,,,,,0.0,0.0,PLACE HOLDER
7117,121922TDBV1158,sample,LUMPED-121922TDBVD,2022-12-19T11:58:00-0700,TDBV,Thorndyke Bay,47.78297,-122.733,,EPSG:4326,"Hood Canal, Puget Sound",US,0.0,115.0,PLACE HOLDER


## Export `event_df` to csv

In [39]:
event_df.eventType.value_counts()

sample          3567
stationVisit    3551
Name: eventType, dtype: int64

In [40]:
if not debug_no_csvexport:
    event_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_event.csv', index=False)

## Package versions

In [41]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}, geopandas: {gpd.__version__}"
)

2024-02-09 23:17:07.970466 +00:00
pandas: 1.5.3, geopandas: 0.12.2
