:::{note}
This notebook documents how to download the data from the ERDDAP on CIOOS Atlantic.

The outcome of this notebooks is the file `metadata.csv` which is already included in the repository.
:::

# Download the Dataset

We want to analyze the Centre for Marine Applied Research ([CMAR](https://cmar.ca/coastal-monitoring-program/#data)) Water Quality dataset.

<img src="https://cmar.ca/wp-content/themes/cmar/images/logo-cmar.png" width="30%">

<img src="https://cmar.ca/wp-content/uploads/sites/22/2023/12/Detailed-Version-Flipped-2-768x994.png" width="30%"/>

In [2]:
from erddapy import ERDDAP
import os
import pandas as pd
from tqdm.notebook import tqdm

The data is available from [CIOOS Atlantic](https://catalogue.cioosatlantic.ca/en/organization/cmar)

In [3]:
e = ERDDAP(
    server = "https://cioosatlantic.ca/erddap",
    protocol = "tabledap"
)

Determine the `datasetID` for each CMAR Water Quality dataset.

The study period is 2020-09-01 to 2024-08-31.

In [4]:
e.dataset_id = 'allDatasets'
e.variables = ['datasetID', 'institution', 'title', 'minTime', 'maxTime']

# only grab data from county with data within study period
e.constraints = {'maxTime>=': '2020-09-01', 'minTime<=': '2024-08-31'}
df_allDatasets = e.to_pandas()

In [5]:
df_CMAR_datasets = df_allDatasets[df_allDatasets['institution'].str.contains('CMAR') & df_allDatasets['title'].str.contains('Water Quality Data')].copy()
df_CMAR_datasets['county'] = df_CMAR_datasets['title'].str.removesuffix(' County Water Quality Data')

df_CMAR_datasets.sample(5)

Unnamed: 0,datasetID,institution,title,minTime (UTC),maxTime (UTC),county
25,eda5-aubu,CMAR,Lunenburg County Water Quality Data,2015-11-17T17:00:00Z,2024-11-21T16:15:00Z,Lunenburg
10,gfri-gzxa,Centre for Marine Applied Research (CMAR),Colchester County Water Quality Data,2020-10-01T14:00:00Z,2023-06-07T13:16:25Z,Colchester
56,mq2k-54s4,CMAR,Shelburne County Water Quality Data,2018-02-13T16:32:00Z,2024-07-25T12:15:00Z,Shelburne
54,v6sa-tiit,Centre for Marine Applied Research (CMAR),Richmond County Water Quality Data,2015-11-26T21:00:00Z,2024-10-17T17:49:38Z,Richmond
16,eb3n-uxcb,Centre for Marine Applied Research (CMAR),Guysborough County Water Quality Data,2015-07-29T04:00:00Z,2024-11-07T14:59:10Z,Guysborough


For each of these datasets, we download the temperature data locally.

In [6]:
e.variables = [
 'waterbody',
 'station',
 'deployment_start_date',
 'deployment_end_date',
 'time',
 'depth',
 'temperature',
 'qc_flag_temperature']

e.constraints = { "time>=": "2020-09-01", "time<=": "2024-08-31" }

This step takes 10-15 minutes so we locally cache the data so it only has to be downloaded once.

In [8]:
%%time

os.makedirs('data', exist_ok=True)

for index, row in df_CMAR_datasets.iterrows():

    csvfile = f"data/{row['county']}.csv.gz"

    if os.path.exists(csvfile):
        continue

    print(f"Downloading {row['title']}...")
    e.dataset_id = row['datasetID']
    df = e.to_pandas()

    df.to_csv(csvfile, compression='gzip', index=False)

CPU times: user 1.81 ms, sys: 30 μs, total: 1.84 ms
Wall time: 1.53 ms


We now have the following `.csv` files stored locally:

In [9]:
!ls -lh data/

total 107M
-rw-r--r-- 1 jovyan jovyan 1.4M Jul 31 23:58 Annapolis.csv.gz
-rw-r--r-- 1 jovyan jovyan 4.9M Jul 31 23:58 Antigonish.csv.gz
-rw-r--r-- 1 jovyan jovyan 1.3M Jul 31 23:59 Colchester.csv.gz
-rw-r--r-- 1 jovyan jovyan 5.7M Jul 31 23:59 Digby.csv.gz
-rw-r--r-- 1 jovyan jovyan  45M Aug  1 00:00 Guysborough.csv.gz
-rw-r--r-- 1 jovyan jovyan 7.9M Aug  1 00:00 Halifax.csv.gz
-rw-r--r-- 1 jovyan jovyan 1.5M Aug  1 00:01 Inverness.csv.gz
-rw-r--r-- 1 jovyan jovyan  17M Aug  1 00:05 Lunenburg.csv.gz
-rw-r--r-- 1 jovyan jovyan 2.0M Aug  1 00:05 Pictou.csv.gz
-rw-r--r-- 1 jovyan jovyan 2.9M Aug  1 00:06 Queens.csv.gz
-rw-r--r-- 1 jovyan jovyan 3.5M Aug  1 00:08 Richmond.csv.gz
-rw-r--r-- 1 jovyan jovyan 6.4M Aug  1 00:09 Shelburne.csv.gz
-rw-r--r-- 1 jovyan jovyan 8.1K Aug  1 00:09 Victoria.csv.gz
-rw-r--r-- 1 jovyan jovyan 8.3M Aug  1 00:12 Yarmouth.csv.gz


We need to organize and sort the observations so that we are considering only the observation for a single sensor in temporal order.

This will remove all of the duplicated metadata within this `.csv` files.

In [10]:
os.makedirs('segments', exist_ok=True)

all_segment_metadata = []
for index, row in tqdm(list(df_CMAR_datasets.iterrows())):

    csvfile = f"data/{row['county']}.csv.gz"

    df = pd.read_csv(csvfile)
    
    df['segment'] = df[['waterbody', 'station', 'depth (m)',
                     'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                     ]].agg(lambda x: row['county'] + '_' + '_'.join([str(y) for y in x]), axis=1)

    df_metadata = df[['segment', 'waterbody', 'station', 'depth (m)',
                     'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                     ]]

    df_metadata = df_metadata.drop_duplicates()
    all_segment_metadata.append(df_metadata)
    
    df_data = df.drop(columns=['waterbody', 'station', 'depth (m)',
                                 'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                              ])
    
    df_data = df_data.sort_values(by=['segment', 'time (UTC)'])

    df_data.set_index(['segment', 'time (UTC)'], inplace=True)

    for key, segment_df in df_data.groupby(level=0):
        csvfile = f'segments/{key}.csv'
        segment_df = segment_df.droplevel(0)
        segment_df.to_csv(csvfile)

df_metadata = pd.concat(all_segment_metadata)
df_metadata.set_index('segment', inplace=True)
df_metadata.to_csv('metadata.csv')

  0%|          | 0/14 [00:00<?, ?it/s]

In [11]:
!ls -lh segments/ | wc

   1082   12409  145901


We have 852 distinct observational time series taken at various locations and depths around Nova Scotia during the period of 2020-09-01 to 2024-08-31

In [12]:
df_metadata.head(8)

Unnamed: 0_level_0,waterbody,station,depth (m),deployment_start_date (UTC),deployment_end_date (UTC)
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Annapolis_Annapolis Basin_Cornwallis_2.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Cornwallis,2.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_2.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Lobster Ledge,2.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_6.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Lobster Ledge,6.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_4.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Lobster Ledge,4.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Cornwallis_1.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Cornwallis,1.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Cornwallis_1.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Cornwallis,1.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
Annapolis_Annapolis Basin_Cornwallis_2.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Cornwallis,2.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_6.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Lobster Ledge,6.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
