<a href="https://colab.research.google.com/github/MathewBiddle/bio_data_guide/blob/main/datasets/caricoos_sargassum/CariCOOS_sargassum_biomass_2_DarwinCore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CariCOOS Sargassum Biomass to Darwin Core

This notebook walks through the process of standardizing the CariCOOS Sargassum biomass data to the Darwin Core standard.

Source data:
* <http://dm3.caricoos.org/thredds/catalog/content/Parguera_Sargassum/Sargassum_Biomass/catalog.html>

In [1]:
import pandas as pd
import xarray as xr

## First, lets take a look at one of the datasets

We know the data are hosted on a THREDDS server, so we can use some tools like xarray to investigate the data.

In [2]:
urls = [
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Varadero.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_San_Cristobal.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Monsio_Jose_Entrada.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Monsio_Jose_Centro.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Monsio_Jose.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Media_Luna_4.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Media_Luna_2.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Media_Luna.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Laurel.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Godo.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Enrique.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Corral_Oeste.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Corral_Este.nc',
    'http://dm3.caricoos.org/thredds/fileServer/content/Parguera_Sargassum/Sargassum_Biomass/sargassum_biomass_Corral_Centro.nc'
]

In [3]:
ds = xr.open_mfdataset(urls, combine='nested', concat_dim='time', data_vars='all')

ds

Unnamed: 0,Array,Chunk
Bytes,10.71 kiB,0.96 kiB
Shape,"(2741,)","(246,)"
Dask graph,14 chunks in 29 graph layers,14 chunks in 29 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.71 kiB 0.96 kiB Shape (2741,) (246,) Dask graph 14 chunks in 29 graph layers Data type float32 numpy.ndarray",2741  1,

Unnamed: 0,Array,Chunk
Bytes,10.71 kiB,0.96 kiB
Shape,"(2741,)","(246,)"
Dask graph,14 chunks in 29 graph layers,14 chunks in 29 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.41 kiB,1.92 kiB
Shape,"(2741,)","(246,)"
Dask graph,14 chunks in 29 graph layers,14 chunks in 29 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 21.41 kiB 1.92 kiB Shape (2741,) (246,) Dask graph 14 chunks in 29 graph layers Data type float64 numpy.ndarray",2741  1,

Unnamed: 0,Array,Chunk
Bytes,21.41 kiB,1.92 kiB
Shape,"(2741,)","(246,)"
Dask graph,14 chunks in 29 graph layers,14 chunks in 29 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Let's make this a DataFrame that we can work with.

In [4]:
df = ds.to_dataframe()

df.sample(n=5)

Unnamed: 0_level_0,station_id,latitude,longitude,Sargassum_biomass,Sargassum_biomass_Flag
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-12-01,Media Luna 2,17.939409,-67.040291,,9.0
2023-08-18,Monsio Jose Entrada,17.967649,-67.076546,0.03125,2.0
2024-08-23,Corral Centro,17.943399,-67.005127,0.0,2.0
2021-04-21,Corral Este,17.944214,-67.002602,50.757812,2.0
2023-03-31,Enrique,17.954504,-67.050285,0.0,2.0


Okay, so we have a location and time and some information about biomass and a flag qualifier.

Let's see what the flags mean.

In [5]:
ds.Sargassum_biomass_Flag.attrs

{'Dimensions': 'time',
 'standard_name': 'quality_flag',
 'flag_values': array([0, 1, 2, 3, 4, 9], dtype=int32),
 'flag_meanings': 'calculated_data, not_analyzed, good_data, questtionable_data, bad_data, missing_data',
 'long_name': 'Quality flag for Sargassum_biomass',
 'coverage_content_type': 'qualityInformation'}

Cool! So, a flag value of 9 == "missing_data". In my opinion, we should only work with data that has `Sargassum_biomass_Flag` == 2 (or "good_data").

In [6]:
df_occur = df.loc[df['Sargassum_biomass_Flag'] == 2]

So, what do we need for Darwin Core alignment?

Columns: 'occurrenceID','countryCode', 'kingdom', 'geodeticDatum','eventDate', 'decimalLongitude', 'decimalLatitude', 'scientificName', 'scientificNameID', 'occurrenceStatus', 'basisOfRecord'.

In [7]:
req_cols = ['occurrenceID','countryCode', 'kingdom', 'geodeticDatum','eventDate', 'decimalLongitude', 'decimalLatitude', 'scientificName', 'scientificNameID', 'occurrenceStatus', 'basisOfRecord']

missing_cols = []
for col in req_cols:
    if col not in df_occur.columns:
        print('Column {} is missing.'.format(col))

Column occurrenceID is missing.
Column countryCode is missing.
Column kingdom is missing.
Column geodeticDatum is missing.
Column eventDate is missing.
Column decimalLongitude is missing.
Column decimalLatitude is missing.
Column scientificName is missing.
Column scientificNameID is missing.
Column occurrenceStatus is missing.
Column basisOfRecord is missing.


Whoa, we're missing a bunch. Let's do some renaming and mapping.

In [8]:
df_occur = df_occur.reset_index()

df_occur.rename(
    columns={'latitude':'decimalLatitude',
             'longitude':'decimalLongitude',
             'time':'eventDate',
             },
    inplace=True,
)

df_occur

Unnamed: 0,eventDate,station_id,decimalLatitude,decimalLongitude,Sargassum_biomass,Sargassum_biomass_Flag
0,2021-01-13,Varadero,17.972265,-67.065926,0.000000,2.0
1,2021-02-04,Varadero,17.972265,-67.065926,0.000000,2.0
2,2021-02-17,Varadero,17.972265,-67.065926,0.000000,2.0
3,2021-03-04,Varadero,17.972265,-67.065926,0.000000,2.0
4,2021-03-19,Varadero,17.972265,-67.065926,0.000000,2.0
...,...,...,...,...,...,...
2163,2025-03-14,Corral Centro,17.943399,-67.005127,0.078125,2.0
2164,2025-03-21,Corral Centro,17.943399,-67.005127,0.000000,2.0
2165,2025-03-28,Corral Centro,17.943399,-67.005127,0.132812,2.0
2166,2025-04-04,Corral Centro,17.943399,-67.005127,0.062500,2.0


Let's check again to see what we are missing.

In [9]:
missing_cols = []
for col in req_cols:
    if col not in df_occur.columns:
        print('Column {} is missing.'.format(col))

Column occurrenceID is missing.
Column countryCode is missing.
Column kingdom is missing.
Column geodeticDatum is missing.
Column scientificName is missing.
Column scientificNameID is missing.
Column occurrenceStatus is missing.
Column basisOfRecord is missing.


Okay, we're missing mainly the taxonomic information. Let's see how we can gather that a make a mapping table. Luckily, we are only working with Sargassum, so we can search WoRMS for sargassum and see what we find.

In [10]:
!pip install pyworms



In [11]:
import pyworms
worms_info = pyworms.aphiaRecordsByMatchNames('sargassum', marine_only=True)

worms_info

[[{'AphiaID': 144132,
   'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=144132',
   'scientificname': 'Sargassum',
   'authority': 'C.Agardh, 1820',
   'status': 'accepted',
   'unacceptreason': None,
   'taxonRankID': 180,
   'rank': 'Genus',
   'valid_AphiaID': 144132,
   'valid_name': 'Sargassum',
   'valid_authority': 'C.Agardh, 1820',
   'parentNameUsageID': 143725,
   'kingdom': 'Chromista',
   'phylum': 'Ochrophyta',
   'class': 'Phaeophyceae',
   'order': 'Fucales',
   'family': 'Sargassaceae',
   'genus': 'Sargassum',
   'citation': 'Guiry, M.D. & Guiry, G.M. (2025). AlgaeBase. World-wide electronic publication, National University of Ireland, Galway (taxonomic information republished from AlgaeBase with permission of M.D. Guiry). Sargassum C.Agardh, 1820. Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=144132 on 2025-05-14',
   'lsid': 'urn:lsid:marinespecies.org:taxname:144132',
   'isMarine': 1,

Awesome! Now we have our taxonomic information!! Let's add it to the data frame.

In [12]:
df_occur['kingdom'] = worms_info[0][0]['kingdom']
df_occur['scientificName'] = worms_info[0][0]['scientificname']
df_occur['scientificNameID'] = worms_info[0][0]['lsid']
df_occur.sample(n=5)

Unnamed: 0,eventDate,station_id,decimalLatitude,decimalLongitude,Sargassum_biomass,Sargassum_biomass_Flag,kingdom,scientificName,scientificNameID
1747,2024-05-17,Corral Oeste,17.943796,-67.009186,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132
2051,2022-06-24,Corral Centro,17.943399,-67.005127,1.789062,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132
1506,2023-02-17,Enrique,17.954504,-67.050285,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132
354,2025-04-04,San Cristobal,17.942074,-67.076714,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132
427,2024-02-16,Monsio Jose Entrada,17.967649,-67.076546,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132


Let's check again to see if we are missing anything.

In [13]:
missing_cols = []
for col in req_cols:
    if col not in df_occur.columns:
        print('Column {} is missing.'.format(col))

Column occurrenceID is missing.
Column countryCode is missing.
Column geodeticDatum is missing.
Column occurrenceStatus is missing.
Column basisOfRecord is missing.


Let's build an eventID and occurrenceID.

In [14]:
df_occur['eventID'] = df_occur['station_id'].replace(" ","_")+"_"+df_occur['eventDate'].dt.strftime('%Y-%m-%d')


df_occur['occurrenceID'] = 'CARICOOS'+"_"+ds.project.replace(" ","_")+"_"+df_occur["scientificName"]+"_"+df_occur['eventID']

df_occur[['occurrenceID','eventID']].sample(n=5)

Unnamed: 0,occurrenceID,eventID
2008,CARICOOS_Sargassum_monitoring_program_Sargassu...,Corral Centro_2021-07-23
1564,CARICOOS_Sargassum_monitoring_program_Sargassu...,Enrique_2024-07-26
1325,CARICOOS_Sargassum_monitoring_program_Sargassu...,Godo_2023-03-17
1837,CARICOOS_Sargassum_monitoring_program_Sargassu...,Corral Este_2021-11-23
1425,CARICOOS_Sargassum_monitoring_program_Sargassu...,Enrique_2021-04-08


Now what are we missing?

In [15]:
missing_cols = []
for col in req_cols:
    if col not in df_occur.columns:
        print('Column {} is missing.'.format(col))

Column countryCode is missing.
Column geodeticDatum is missing.
Column occurrenceStatus is missing.
Column basisOfRecord is missing.


Lets add the rest of them.

In [16]:
df_occur['countryCode'] = "US"
df_occur['geodeticDatum'] = "WGS84"
df_occur['occurrenceStatus'] = "present"
df_occur['basisOfRecord'] = "HumanObservation"

df_occur.sample(n=5)

Unnamed: 0,eventDate,station_id,decimalLatitude,decimalLongitude,Sargassum_biomass,Sargassum_biomass_Flag,kingdom,scientificName,scientificNameID,eventID,occurrenceID,countryCode,geodeticDatum,occurrenceStatus,basisOfRecord
266,2023-03-24,San Cristobal,17.942074,-67.076714,41.03125,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,San Cristobal_2023-03-24,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
2064,2022-10-20,Corral Centro,17.943399,-67.005127,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Corral Centro_2022-10-20,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
1232,2025-02-14,Laurel,17.943192,-67.056442,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Laurel_2025-02-14,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
965,2022-11-23,Media Luna,17.939505,-67.04287,0.0,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Media Luna_2022-11-23,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
406,2023-07-21,Monsio Jose Entrada,17.967649,-67.076546,2.8125,2.0,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Monsio Jose Entrada_2023-07-21,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation


One last check for all the right columns

In [17]:
missing_cols = []
for col in req_cols:
    if col not in df_occur.columns:
        print('Column {} is missing.'.format(col))

Awesome! We've got them all. Good work!

So, we now have an Occurrence Core table. But, what about the biomass data? That's useful information! Let's map it to the extended measurement or fact extension. But first, let's make an event table.

In [18]:
df_event = df_occur[['eventDate','decimalLatitude','decimalLongitude','eventID','geodeticDatum']]

df_event.sample(n=5)

Unnamed: 0,eventDate,decimalLatitude,decimalLongitude,eventID,geodeticDatum
1069,2021-04-27,17.943192,-67.056442,Laurel_2021-04-27,WGS84
1433,2021-06-04,17.954504,-67.050285,Enrique_2021-06-04,WGS84
529,2022-03-04,17.971815,-67.07235,Monsio Jose Centro_2022-03-04,WGS84
1303,2022-07-22,17.970369,-67.047806,Godo_2022-07-22,WGS84
1532,2023-09-01,17.954504,-67.050285,Enrique_2023-09-01,WGS84


Let's add a few additional details to the event.

* 'coordinateUncertaintyInMeters'
* 'minimumDepthInMeters'
* 'maximumDepthInMeters'

In [19]:
pd.options.mode.copy_on_write = True

df_event['coordinateUncertaintyInMeters'] = 'unknown'
df_event['minimumDepthInMeters'] = 0
df_event['maximumDepthInMeters'] = 0
df_event['samplingProtocol'] = ds.Sargassum_biomass.comment

df_event.sample(n=5)

Unnamed: 0,eventDate,decimalLatitude,decimalLongitude,eventID,geodeticDatum,coordinateUncertaintyInMeters,minimumDepthInMeters,maximumDepthInMeters,samplingProtocol
311,2024-04-26,17.942074,-67.076714,San Cristobal_2024-04-26,WGS84,unknown,0,0,The weekly Sargassum biomass influx (kg∙m-1∙Wk...
112,2023-09-01,17.972265,-67.065926,Varadero_2023-09-01,WGS84,unknown,0,0,The weekly Sargassum biomass influx (kg∙m-1∙Wk...
93,2023-04-20,17.972265,-67.065926,Varadero_2023-04-20,WGS84,unknown,0,0,The weekly Sargassum biomass influx (kg∙m-1∙Wk...
2078,2023-03-24,17.943399,-67.005127,Corral Centro_2023-03-24,WGS84,unknown,0,0,The weekly Sargassum biomass influx (kg∙m-1∙Wk...
893,2021-04-21,17.939505,-67.04287,Media Luna_2021-04-21,WGS84,unknown,0,0,The weekly Sargassum biomass influx (kg∙m-1∙Wk...


Okay, we have an event table and an occurrence table. The last step is to create an extended measurement or fact extension table.

What do we need?

* eventID
* measurementDeterminedDate
* measurementType
* measurementValue
* measurementTypeID
* measurementUnit
* measurementUnitID
* measurementAccuracy
* measruementMethod


In [20]:
df_emof = df_occur[['eventID','eventDate','Sargassum_biomass','Sargassum_biomass_Flag']]

df_emof.rename(columns={
    'Sargassum_biomass':'measurementValue',
    'eventDate':'measurementDeterminedDate',
},
    inplace=True,
)

df_emof['measurementType'] = ds.Sargassum_biomass.long_name
df_emof['measurementUnit'] = ds.Sargassum_biomass.units
df_emof['measurementMethod'] = ds.Sargassum_biomass.method

df_emof.drop(columns='Sargassum_biomass_Flag', inplace=True)

df_emof.sample(n=5)

Unnamed: 0,eventID,measurementDeterminedDate,measurementValue,measurementType,measurementUnit,measurementMethod
712,Monsio Jose_2022-03-25,2022-03-25,0.0,Sargassum biomass concentration,kg m-1,Measured in the field
1212,Laurel_2024-09-20,2024-09-20,0.0,Sargassum biomass concentration,kg m-1,Measured in the field
1037,Media Luna_2024-09-27,2024-09-27,0.0,Sargassum biomass concentration,kg m-1,Measured in the field
1424,Enrique_2021-03-19,2021-03-19,0.0,Sargassum biomass concentration,kg m-1,Measured in the field
1526,Enrique_2023-07-21,2023-07-21,6.78125,Sargassum biomass concentration,kg m-1,Measured in the field


Now, let's hop over to NERC to get our identifiers.

https://vocab.nerc.ac.uk/



In [21]:
df_emof['measurementTypeID'] = 'http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL05/'
df_emof['measurementUnitID'] = 'http://vocab.nerc.ac.uk/collection/S02/current/S040/'

df_emof.sample(n=5)

Unnamed: 0,eventID,measurementDeterminedDate,measurementValue,measurementType,measurementUnit,measurementMethod,measurementTypeID,measurementUnitID
189,San Cristobal_2021-06-04,2021-06-04,8.6875,Sargassum biomass concentration,kg m-1,Measured in the field,http://vocab.nerc.ac.uk/collection/P01/current...,http://vocab.nerc.ac.uk/collection/S02/current...
1269,Godo_2021-10-22,2021-10-22,0.0,Sargassum biomass concentration,kg m-1,Measured in the field,http://vocab.nerc.ac.uk/collection/P01/current...,http://vocab.nerc.ac.uk/collection/S02/current...
1493,Enrique_2022-09-09,2022-09-09,1.34375,Sargassum biomass concentration,kg m-1,Measured in the field,http://vocab.nerc.ac.uk/collection/P01/current...,http://vocab.nerc.ac.uk/collection/S02/current...
1020,Media Luna_2024-04-26,2024-04-26,0.0,Sargassum biomass concentration,kg m-1,Measured in the field,http://vocab.nerc.ac.uk/collection/P01/current...,http://vocab.nerc.ac.uk/collection/S02/current...
1802,Corral Este_2021-03-04,2021-03-04,5.242188,Sargassum biomass concentration,kg m-1,Measured in the field,http://vocab.nerc.ac.uk/collection/P01/current...,http://vocab.nerc.ac.uk/collection/S02/current...


We can clean up the occurence records now.

In [22]:
df_occur.drop(columns=['Sargassum_biomass','Sargassum_biomass_Flag','station_id'],inplace=True)

df_occur.sample(n=5)

Unnamed: 0,eventDate,decimalLatitude,decimalLongitude,kingdom,scientificName,scientificNameID,eventID,occurrenceID,countryCode,geodeticDatum,occurrenceStatus,basisOfRecord
765,2023-07-14,17.968765,-67.076874,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Monsio Jose_2023-07-14,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
1574,2024-10-11,17.954504,-67.050285,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Enrique_2024-10-11,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
1226,2024-12-26,17.943192,-67.056442,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Laurel_2024-12-26,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
1661,2022-03-25,17.943796,-67.009186,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Corral Oeste_2022-03-25,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation
2015,2021-09-01,17.943399,-67.005127,Chromista,Sargassum,urn:lsid:marinespecies.org:taxname:144132,Corral Centro_2021-09-01,CARICOOS_Sargassum_monitoring_program_Sargassu...,US,WGS84,present,HumanObservation


Now let's write them to CSV files to upload to the OBIS-USA IPT.

In [23]:

df_occur.to_csv('occur.csv', index=False)
df_event.to_csv('event.csv', index=False)
df_emof.to_csv('emof.csv',index=False)
