# Sound production to presence

This notebook uses the coastwatch erddap to collect species presence information from SanctSound sound production datasets.

It creates a `sanctsound_presence.zip` file that contains when animals were acoustically present at a specific time and location.

The notebook `presence_to_occurrence.ipynb` reads the results of this notebook and converts them to an occurrence table. 
The notebook `sound_propagation_processing.ipynb` reads the occurrence table and adds information about the coordinateUncertainty from sound propagation modeling data.


Let's search the [Coastwatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/index.html) for datasets that contain the following information:

```
sanctsound "Sound Production"
```

In [1]:
import erddapy

erddapy.__version__

'2.2.0'

In [2]:
from erddapy import ERDDAP
import pandas as pd

server = "https://coastwatch.pfeg.noaa.gov/erddap/"

protocol = "griddap"

search_for = 'sanctsound "Sound Production"'

e = ERDDAP(server=server, protocol=protocol)

url = e.get_search_url(search_for=search_for, response="csv")

datasets = pd.read_csv(url)[["Dataset ID","Title"]]

datasets

Unnamed: 0,Dataset ID,Title
0,noaaSanctSound_GR01_01_dolphins_1h,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
1,noaaSanctSound_GR01_02_dolphins_1h,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
2,noaaSanctSound_GR01_03_dolphins_1h,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
3,noaaSanctSound_GR01_04_dolphins_1h,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
4,noaaSanctSound_GR01_05_dolphins_1h,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
...,...,...
690,noaaSanctSound_SB03_08_finwhale_1d,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
691,noaaSanctSound_SB03_09_finwhale_1d,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
692,noaaSanctSound_SB03_10_finwhale_1d,NOAA-Navy Sanctuary Soundscape Monitoring Proj...
693,noaaSanctSound_SB03_11_finwhale_1d,NOAA-Navy Sanctuary Soundscape Monitoring Proj...


# Next let's start building our presence table from each dataset

Now lets find the `start_time` and `end_time` when animals were present (e.g. `dolphin_presence == 1`).

To do this we look through the variables in the dataset to find the data variable which ends with the phrase `presence` or `detection_count`. This will return the string of the variable name (e.g. `dolphin_presence`). Then, we want to filter the dataset for when that variable only has values equal to 1.0 (or present). Then, we drop any entrys not equal to 1.0.

This returns a filtered xarray dataset of only presence values along with `start_time`, `end_time`, and all the associated metadata.

In [3]:
%%time

df_final = pd.DataFrame()

df_broken = pd.DataFrame()

for index, row in datasets.iterrows():
    
    e.dataset_id = row['Dataset ID']
    
    ds = e.to_xarray()
    
    time_var = list(ds.coords)[0]
    
    # set up try/except to test for presence vars, skip datasets without them.
    try:
        da = [da for varname, da in ds.data_vars.items() if (varname.endswith("presence") | varname.endswith("detection_count"))][0]
    except:
        
        string = 'Skipping {} - no presence vars'.format(row['Dataset ID'])
        
        df_broke = pd.DataFrame({'Dataset ID': [row['Dataset ID']], 
                                'reason': [string]})
        df_broken = pd.concat([df_broken, df_broke])
        
        continue

    
    # subset to only presences (presence var == 1)
    ds_subset = da[da.values != 0]
    
    # kick out if ds_subset is empty
    if len(ds_subset[time_var]) == 0:
        
        string = '{} obs - moving to next dataset'.format(len(ds_subset[time_var]))
        
        df_broke = pd.DataFrame({'Dataset ID': [row['Dataset ID']], 
                                'reason': [string]})
        
        df_broken = pd.concat([df_broken, df_broke])
        
        continue
        
    df = ds_subset.to_dataframe().reset_index()
    
    # print some details about what is happening.
    if index==1:
        print('\n{}/{} datasets\n'.format(index+1,datasets.shape[0]))
        print('querying {}'.format(row['Dataset ID']))
        print('{} ({} rows): {}'.format(row['Dataset ID'],ds.coords.dims[time_var],ds.geospatial_bounds))
        print('Subsetted to ({} rows)'.format(len(ds_subset[time_var])))
    
    # Store global attributes as data.
    df['dataset_id'] = row['Dataset ID']
    df['WKT'] = ds.geospatial_bounds
    df['decimalLatitude'] = ds.geospatial_bounds.split(" ")[1].replace("(","")
    df['decimalLongitude'] = ds.geospatial_bounds.split(" ")[2].replace(")","")
    df['vernacularName'] = ds.title.split(",")[1].replace(" Sound Production","").replace(" Sound Producion","").lower().lstrip()
    
    # Create the BIG dataframe
    df_final = pd.concat([df_final, df])

print(f'Found {df_final.shape[0]} datasets. There were {df_broken.shape[0]} datasets that either didn\'t have a presence variable or didn\'t contain presence data')


2/695 datasets

querying noaaSanctSound_GR01_02_dolphins_1h
noaaSanctSound_GR01_02_dolphins_1h (3029 rows): POINT (31.396417 -80.8904)
Subsetted to (56 rows)
Found 713676 datasets. There were 186 datasets that either didn't have a presence variable or didn't contain presence data
CPU times: total: 40.7 s
Wall time: 5min 20s


In [4]:
df_final.sample(n=5)

Unnamed: 0,start_time,dolphin_presence,dataset_id,WKT,decimalLatitude,decimalLongitude,vernacularName,time,bluewhale_presence,bluewhale_manual_presence,...,pinniped_presence,redgrouper_detection_count,seiwhale_presence,atlanticcod_presence,blackgrouper_detection_count,humpbackwhale_presence,killerwhale_presence,minkewhale_presence,plainfinmidshipman_presence,northatlanticrightwhale_presence
40112,NaT,,noaaSanctSound_MB01_03_bluewhale,POINT (36.798 -121.976),36.798,-121.976,blue whale,2019-12-08 15:22:29.368,1.0,,...,,,,,,,,,,
40783,NaT,,noaaSanctSound_MB01_06_bluewhale,POINT (36.7977 -121.9757),36.7977,-121.9757,blue whale,2020-12-06 23:15:11.864,1.0,,...,,,,,,,,,,
22773,NaT,,noaaSanctSound_CI02_05_bluewhale,POINT (34.0853 -120.5223),34.0853,-120.5223,blue whale,2020-09-12 14:51:11.608,1.0,,...,,,,,,,,,,
26017,NaT,,noaaSanctSound_CI04_02_bluewhale,POINT (33.8489 -120.1175),33.8489,-120.1175,blue whale,2019-08-03 03:32:30.504,1.0,,...,,,,,,,,,,
17670,NaT,,noaaSanctSound_MB01_06_bluewhale,POINT (36.7977 -121.9757),36.7977,-121.9757,blue whale,2020-10-19 03:01:40.848,1.0,,...,,,,,,,,,,


## WoRMS Mapping
WoRMS lookup. Abby Benson created a mapping table which we will use below to insert the appropriate WoRMS idenfitiers.

In [5]:
df_mapping = pd.read_csv('SanctSound_SpeciesLookupTable.csv')

df_mapping

Unnamed: 0,vernacularName,scientificName,scientificNameID,taxonRank,kingdom,propagationFrequency
0,dolphin,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000
1,blue whale,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63
2,bocaccio,Sebastes paucispinis,urn:lsid:marinespecies.org:taxname:274833,Species,Animalia,300
3,fin whale,Balaenoptera physalus,urn:lsid:marinespecies.org:taxname:137091,Species,Animalia,20
4,pinniped,Pinnipedia,urn:lsid:marinespecies.org:taxname:148736,Infraorder,Animalia,1000
5,red grouper,Epinephelus morio,urn:lsid:marinespecies.org:taxname:159354,Species,Animalia,125
6,sei whale,Balaenoptera borealis,urn:lsid:marinespecies.org:taxname:137088,Species,Animalia,63
7,black grouper,Mycteroperca bonaci,urn:lsid:marinespecies.org:taxname:159231,Species,Animalia,125
8,humpback whale,Megaptera novaeangliae,urn:lsid:marinespecies.org:taxname:137092,Species,Animalia,300
9,killer whale,Orcinus orca,urn:lsid:marinespecies.org:taxname:137102,Species,Animalia,1000


Now lets add in the WoRMS mapping for species information.

In [6]:
# merge in the WoRMS species information
df_presence = df_final.merge(df_mapping, how='left', on='vernacularName')  

df_presence.sample(5)

Unnamed: 0,start_time,dolphin_presence,dataset_id,WKT,decimalLatitude,decimalLongitude,vernacularName,time,bluewhale_presence,bluewhale_manual_presence,...,humpbackwhale_presence,killerwhale_presence,minkewhale_presence,plainfinmidshipman_presence,northatlanticrightwhale_presence,scientificName,scientificNameID,taxonRank,kingdom,propagationFrequency
324893,NaT,,noaaSanctSound_CI04_08_bluewhale,POINT (33.8485 -120.1159),33.8485,-120.1159,blue whale,2021-10-08 08:24:51.480,1.0,,...,,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63
388404,NaT,,noaaSanctSound_MB01_03_bluewhale,POINT (36.798 -121.976),36.798,-121.976,blue whale,2019-11-16 17:29:12.576,1.0,,...,,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63
206128,NaT,,noaaSanctSound_CI04_05_bluewhale,POINT (33.8489 -120.1171),33.8489,-120.1171,blue whale,2020-08-06 06:44:04.040,1.0,,...,,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63
400416,NaT,,noaaSanctSound_MB01_03_bluewhale,POINT (36.798 -121.976),36.798,-121.976,blue whale,2019-12-19 16:32:50.072,1.0,,...,,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63
167845,NaT,,noaaSanctSound_CI04_03_bluewhale,POINT (33.84888 -120.117),33.84888,-120.117,blue whale,2019-09-27 11:43:21.720,1.0,,...,,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63


In [7]:
df_presence.columns

Index(['start_time', 'dolphin_presence', 'dataset_id', 'WKT',
       'decimalLatitude', 'decimalLongitude', 'vernacularName', 'time',
       'bluewhale_presence', 'bluewhale_manual_presence', 'bocaccio_presence',
       'finwhale_presence', 'pinniped_presence', 'redgrouper_detection_count',
       'seiwhale_presence', 'atlanticcod_presence',
       'blackgrouper_detection_count', 'humpbackwhale_presence',
       'killerwhale_presence', 'minkewhale_presence',
       'plainfinmidshipman_presence', 'northatlanticrightwhale_presence',
       'scientificName', 'scientificNameID', 'taxonRank', 'kingdom',
       'propagationFrequency'],
      dtype='object')

## Determining `time`

Okay, we have two time variables: 
`start_time`, and `time`

We need to make one `eventDate`!

Let's first check to see if we can mash things together.

First, lets print out all the times when `time` has an entry:

In [8]:
df_presence.loc[df_presence['time'].notna(),['start_time','time']].sample(20)

Unnamed: 0,start_time,time
336825,NaT,2019-07-25 19:24:07.336000000
284062,NaT,2021-07-09 20:22:54.752000000
380509,NaT,2019-11-05 13:28:09.000000000
274011,NaT,2020-12-15 11:07:25.736000000
357635,NaT,2019-09-07 19:34:46.648000000
60865,NaT,2020-11-01 04:01:27.480000000
561094,NaT,2020-08-02 07:13:59.648000000
597312,NaT,2021-01-16 13:41:41.600000256
126411,NaT,2019-07-21 11:27:18.424000000
355014,NaT,2018-12-04 15:20:52.496000000


Okay, so let's see if `start_time` is only NaN for all those rows:

In [9]:
df_presence.loc[df_presence['time'].notna(),'start_time'].unique()

array(['NaT'], dtype='datetime64[ns]')

Looking good! We have only NaN's returned so we don't have conflicting dates between `start_time` and `time`.

Fantastic! So, this means we can make a new column for `eventDate` which merges `time` into `start_time`.

In [10]:
#df_presence_copy = df_presence.copy()

# start eventDate column with values where `time` exists.
df_presence['eventDate'] = df_presence.loc[df_presence['time'].notna(),['time']]

# fillna with values from start_time
df_presence['eventDate'].fillna(df_presence['start_time'], inplace=True)

df_presence[['eventDate','time','start_time']].sample(n=5)

Unnamed: 0,eventDate,time,start_time
534055,2018-11-28 01:29:19.000,2018-11-28 01:29:19.000,NaT
3738,2018-11-01 05:20:32.368,2018-11-01 05:20:32.368,NaT
248818,2020-10-13 14:25:34.496,2020-10-13 14:25:34.496,NaT
423838,2020-10-09 16:56:23.392,2020-10-09 16:56:23.392,NaT
570024,2020-09-12 09:38:16.800,2020-09-12 09:38:16.800,NaT


In [11]:
df_presence.loc[df_presence['eventDate'].isna()]

Unnamed: 0,start_time,dolphin_presence,dataset_id,WKT,decimalLatitude,decimalLongitude,vernacularName,time,bluewhale_presence,bluewhale_manual_presence,...,killerwhale_presence,minkewhale_presence,plainfinmidshipman_presence,northatlanticrightwhale_presence,scientificName,scientificNameID,taxonRank,kingdom,propagationFrequency,eventDate


## Double check we moved the right values

Show me where `time` is NaN and we used `start_time`.

In [12]:
df_presence.loc[df_presence['time'].isna(),['start_time','eventDate','time']].sample(5)

Unnamed: 0,start_time,eventDate,time
669263,2019-12-26 10:00:00.000,2019-12-26 10:00:00.000,NaT
676601,2019-08-09 19:00:00.000,2019-08-09 19:00:00.000,NaT
501325,2021-11-01 07:42:52.864,2021-11-01 07:42:52.864,NaT
686945,2020-10-24 13:00:00.000,2020-10-24 13:00:00.000,NaT
628795,2020-04-21 09:23:47.588,2020-04-21 09:23:47.588,NaT


Show me where `start_time` is NaN and we used `time`.

In [13]:
df_presence.loc[df_presence['start_time'].isna(),['start_time','eventDate','time']].sample(5)

Unnamed: 0,start_time,eventDate,time
179513,NaT,2019-10-27 09:29:26.232,2019-10-27 09:29:26.232
274753,NaT,2020-12-19 01:59:50.824,2020-12-19 01:59:50.824
7047,NaT,2019-07-10 04:23:29.624,2019-07-10 04:23:29.624
525758,NaT,2020-01-11 21:59:55.176,2020-01-11 21:59:55.176
157595,NaT,2019-08-26 14:57:15.632,2019-08-26 14:57:15.632


Now, lets make `eventDate` the index for our DataFrame so we can make a nice plot and output the dates in a format we like.

In [14]:
df_presence['eventDate'] = pd.to_datetime(df_presence['eventDate'], format='%Y-%m-%d %H:%M:%S.%f')

df_presence

Unnamed: 0,start_time,dolphin_presence,dataset_id,WKT,decimalLatitude,decimalLongitude,vernacularName,time,bluewhale_presence,bluewhale_manual_presence,...,killerwhale_presence,minkewhale_presence,plainfinmidshipman_presence,northatlanticrightwhale_presence,scientificName,scientificNameID,taxonRank,kingdom,propagationFrequency,eventDate
0,2018-12-15 04:00:00,1.0,noaaSanctSound_GR01_01_dolphins_1h,POINT (31.396417 -80.8904),31.396417,-80.8904,dolphin,NaT,,,...,,,,,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000,2018-12-15 04:00:00
1,2018-12-15 05:00:00,1.0,noaaSanctSound_GR01_01_dolphins_1h,POINT (31.396417 -80.8904),31.396417,-80.8904,dolphin,NaT,,,...,,,,,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000,2018-12-15 05:00:00
2,2018-12-15 06:00:00,1.0,noaaSanctSound_GR01_01_dolphins_1h,POINT (31.396417 -80.8904),31.396417,-80.8904,dolphin,NaT,,,...,,,,,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000,2018-12-15 06:00:00
3,2018-12-15 07:00:00,1.0,noaaSanctSound_GR01_01_dolphins_1h,POINT (31.396417 -80.8904),31.396417,-80.8904,dolphin,NaT,,,...,,,,,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000,2018-12-15 07:00:00
4,2018-12-15 18:00:00,1.0,noaaSanctSound_GR01_01_dolphins_1h,POINT (31.396417 -80.8904),31.396417,-80.8904,dolphin,NaT,,,...,,,,,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000,2018-12-15 18:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
713671,2020-11-18 00:00:00,,noaaSanctSound_SB03_12_finwhale_1d,POINT (42.25508 -70.179047),42.25508,-70.179047,fin whale,NaT,,,...,,,,,Balaenoptera physalus,urn:lsid:marinespecies.org:taxname:137091,Species,Animalia,20,2020-11-18 00:00:00
713672,2020-11-19 00:00:00,,noaaSanctSound_SB03_12_finwhale_1d,POINT (42.25508 -70.179047),42.25508,-70.179047,fin whale,NaT,,,...,,,,,Balaenoptera physalus,urn:lsid:marinespecies.org:taxname:137091,Species,Animalia,20,2020-11-19 00:00:00
713673,2020-11-20 00:00:00,,noaaSanctSound_SB03_12_finwhale_1d,POINT (42.25508 -70.179047),42.25508,-70.179047,fin whale,NaT,,,...,,,,,Balaenoptera physalus,urn:lsid:marinespecies.org:taxname:137091,Species,Animalia,20,2020-11-20 00:00:00
713674,2020-11-21 00:00:00,,noaaSanctSound_SB03_12_finwhale_1d,POINT (42.25508 -70.179047),42.25508,-70.179047,fin whale,NaT,,,...,,,,,Balaenoptera physalus,urn:lsid:marinespecies.org:taxname:137091,Species,Animalia,20,2020-11-21 00:00:00


## Write presence file

In [15]:
# overwrite to csv file
fname = 'data/sanctsound_presence.zip'
df_presence.to_csv(fname, index=False, compression='zip')

df_presence.sample(n=5)

Unnamed: 0,start_time,dolphin_presence,dataset_id,WKT,decimalLatitude,decimalLongitude,vernacularName,time,bluewhale_presence,bluewhale_manual_presence,...,killerwhale_presence,minkewhale_presence,plainfinmidshipman_presence,northatlanticrightwhale_presence,scientificName,scientificNameID,taxonRank,kingdom,propagationFrequency,eventDate
521618,NaT,,noaaSanctSound_MB02_04_bluewhale,POINT (36.6495 -121.9084),36.6495,-121.9084,blue whale,2019-12-21 21:21:31.000,1.0,,...,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63,2019-12-21 21:21:31.000
333844,NaT,,noaaSanctSound_CI04_08_bluewhale,POINT (33.8485 -120.1159),33.8485,-120.1159,blue whale,2021-11-02 11:02:30.296,1.0,,...,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63,2021-11-02 11:02:30.296
420311,NaT,,noaaSanctSound_MB01_06_bluewhale,POINT (36.7977 -121.9757),36.7977,-121.9757,blue whale,2020-10-02 07:30:31.784,1.0,,...,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63,2020-10-02 07:30:31.784
590136,NaT,,noaaSanctSound_MB03_04_bluewhale,POINT (36.37021 -122.314903),36.37021,-122.314903,blue whale,2020-11-27 04:18:51.488,1.0,,...,,,,,Balaenoptera musculus,urn:lsid:marinespecies.org:taxname:137090,Species,Animalia,63,2020-11-27 04:18:51.488
2110,2021-01-29 02:00:00,1.0,noaaSanctSound_GR02_05_dolphins_1h,POINT (31.376133 -80.839133),31.376133,-80.839133,dolphin,NaT,,,...,,,,,Cetacea,urn:lsid:marinespecies.org:taxname:2688,Infraorder,Animalia,5000,2021-01-29 02:00:00.000
