# Download the Dataset

We want to analyze the Centre for Marine Applied Research ([CMAR](https://cmar.ca/coastal-monitoring-program/#data)) Water Quality dataset.

<img src="https://cmar.ca/wp-content/themes/cmar/images/logo-cmar.png" width="50%">

<img src="https://cmar.ca/wp-content/uploads/sites/22/2023/12/Detailed-Version-Flipped-2-768x994.png" width="50%"/>

In [1]:
from erddapy import ERDDAP
import os
import pandas as pd
from tqdm.notebook import tqdm

The data is available from [CIOOS Atlantic](https://catalogue.cioosatlantic.ca/en/organization/cmar)

In [2]:
e = ERDDAP(
    server = "https://cioosatlantic.ca/erddap",
    protocol = "tabledap"
)

Determine the `datasetID` for each CMAR Water Quality dataset.

The study period is 2020-09-01 to 2024-08-31.

In [3]:
e.dataset_id = 'allDatasets'
e.variables = ['datasetID', 'institution', 'title', 'minTime', 'maxTime']

# only grab data from county with data within study period
e.constraints = {'maxTime>=': '2020-09-01', 'minTime<=': '2024-08-31'}
df_allDatasets = e.to_pandas()

In [4]:
df_CMAR_datasets = df_allDatasets[df_allDatasets['institution'].str.contains('CMAR') & df_allDatasets['title'].str.contains('Water Quality Data')].copy()
df_CMAR_datasets['county'] = df_CMAR_datasets['title'].str.removesuffix(' County Water Quality Data')

df_CMAR_datasets.sample(5)

Unnamed: 0,datasetID,institution,title,minTime (UTC),maxTime (UTC),county
2,knwz-4bap,Centre for Marine Applied Research (CMAR),Annapolis County Water Quality Data,2020-06-11T19:15:00Z,2023-05-24T17:23:45Z,Annapolis
3,kgdu-nqdp,Centre for Marine Applied Research (CMAR),Antigonish County Water Quality Data,2018-07-24T22:15:00Z,2024-10-18T17:35:42Z,Antigonish
10,gfri-gzxa,Centre for Marine Applied Research (CMAR),Colchester County Water Quality Data,2020-10-01T14:00:00Z,2023-06-07T13:16:25Z,Colchester
12,wpsu-7fer,Centre for Marine Applied Research (CMAR),Digby County Water Quality Data,2016-01-21T20:00:00Z,2024-06-04T12:07:18Z,Digby
16,eb3n-uxcb,Centre for Marine Applied Research (CMAR),Guysborough County Water Quality Data,2015-07-29T04:00:00Z,2024-11-07T14:59:10Z,Guysborough
18,x9dy-aai9,Centre for Marine Applied Research (CMAR),Halifax County Water Quality Data,2017-07-24T16:45:00Z,2024-11-21T18:12:15Z,Halifax
24,a9za-3t63,Centre for Marine Applied Research (CMAR),Inverness County Water Quality Data,2015-11-26T21:20:51Z,2022-10-24T12:45:00Z,Inverness
25,eda5-aubu,CMAR,Lunenburg County Water Quality Data,2015-11-17T17:00:00Z,2024-11-21T16:15:00Z,Lunenburg
47,adpu-nyt8,Centre for Marine Applied Research (CMAR),Pictou County Water Quality Data,2017-07-25T20:30:00Z,2024-10-18T14:00:00Z,Pictou
53,qspp-qhb6,Centre for Marine Applied Research (CMAR),Queens County Water Quality Data,2020-06-25T21:21:58Z,2024-07-22T19:30:00Z,Queens


For each of these datasets, we download the temperature data locally.

In [5]:
e.variables = [
 'waterbody',
 'station',
# 'sensor_type',
# 'sensor_serial_number',
# 'rowSize',
# 'lease',
# 'latitude',
# 'longitude',
 'deployment_start_date',
 'deployment_end_date',
# 'string_configuration',
 'time',
 'depth',
# 'depth_crosscheck_flag',
# 'dissolved_oxygen',
# 'salinity',
# 'sensor_depth_measured',
 'temperature',
# 'qc_flag_dissolved_oxygen',
# 'qc_flag_salinity',
# 'qc_flag_sensor_depth_measured',
 'qc_flag_temperature']

e.constraints = { "time>=": "2020-09-01", "time<=": "2024-08-31" }

This takes a few minutes so we locally cache the data so it only has to be downloaded once.

In [6]:
%%time

os.makedirs('data', exist_ok=True)

for index, row in df_CMAR_datasets.iterrows():

    csvfile = f"data/{row['county']}.csv"

    if os.path.exists(csvfile):
        continue

    print(f"Downloading {row['title']}...")
    e.dataset_id = row['datasetID']
    df = e.to_pandas()

    df.to_csv(csvfile, index=False)

CPU times: user 2 ms, sys: 128 μs, total: 2.13 ms
Wall time: 5.5 ms


We now have the following `.csv` files stored locally:

In [7]:
!ls -lh data/

total 1.8G
-rw-r--r-- 1 jmunroe jmunroe  32M Jun 19 14:30 Annapolis.csv
-rw-r--r-- 1 jmunroe jmunroe 101M Jun 19 14:32 Antigonish.csv
-rw-r--r-- 1 jmunroe jmunroe  28M Jun 19 14:32 Colchester.csv
-rw-r--r-- 1 jmunroe jmunroe 122M Jun 19 14:33 Digby.csv
-rw-r--r-- 1 jmunroe jmunroe 707M Jun 19 14:37 Guysborough.csv
-rw-r--r-- 1 jmunroe jmunroe 141M Jun 19 14:38 Halifax.csv
-rw-r--r-- 1 jmunroe jmunroe  52M Jun 19 14:38 Inverness.csv
-rw-r--r-- 1 jmunroe jmunroe 198M Jun 19 14:40 Lunenburg.csv
-rw-r--r-- 1 jmunroe jmunroe  44M Jun 19 14:41 Pictou.csv
-rw-r--r-- 1 jmunroe jmunroe  50M Jun 19 14:42 Queens.csv
-rw-r--r-- 1 jmunroe jmunroe  96M Jun 19 14:42 Richmond.csv
-rw-r--r-- 1 jmunroe jmunroe 103M Jun 19 14:42 Shelburne.csv
-rw-r--r-- 1 jmunroe jmunroe 183K Jun 19 14:42 Victoria.csv
-rw-r--r-- 1 jmunroe jmunroe 161M Jun 19 14:44 Yarmouth.csv


We need to organize and sort the observations so that we are considering only the observation for a single sensor in temporal order.

This will remove all of the duplicated metadata within this `.csv` files.

In [14]:
os.makedirs('segments', exist_ok=True)

all_segment_metadata = []
for index, row in tqdm(list(df_CMAR_datasets.iterrows())):

    csvfile = f"data/{row['county']}.csv"

    df = pd.read_csv(csvfile)
    
    df['segment'] = df[['waterbody', 'station', 'depth (m)',
                     'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                     ]].agg(lambda x: row['county'] + '_' + '_'.join([str(y) for y in x]), axis=1)

    df_metadata = df[['segment', 'waterbody', 'station', 'depth (m)',
                     'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                     ]]

    df_metadata = df_metadata.drop_duplicates()
    all_segment_metadata.append(df_metadata)
    
    df_data = df.drop(columns=['waterbody', 'station', 'depth (m)',
                                 'deployment_start_date (UTC)', 'deployment_end_date (UTC)',
                              ])
    
    df_data = df_data.sort_values(by=['segment', 'time (UTC)'])

    df_data.set_index(['segment', 'time (UTC)'], inplace=True)

    for key, segment_df in df_data.groupby(level=0):
        csvfile = f'segments/{key}.csv'
        segment_df = segment_df.droplevel(0)
        segment_df.to_csv(csvfile)

df_metadata = pd.concat(all_segment_metadata)
df_metadata.set_index('segment', inplace=True)
df_metadata.to_csv('metadata.csv')

  0%|          | 0/14 [00:00<?, ?it/s]

In [15]:
!ls -lh segments/ | wc

    852    9728  116256


We have 852 distinct observational time series taken at various locations and depths around Nova Scotia during the period of 2020-09-01 to 2024-08-31

In [27]:
df_metadata.head(8)

Unnamed: 0_level_0,waterbody,station,depth (m),deployment_start_date (UTC),deployment_end_date (UTC)
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Annapolis_Annapolis Basin_Cornwallis_2.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Cornwallis,2.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_2.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Lobster Ledge,2.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_4.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Lobster Ledge,4.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Cornwallis_1.0_2020-06-11T00:00:00Z_2020-11-22T00:00:00Z,Annapolis Basin,Cornwallis,1.0,2020-06-11T00:00:00Z,2020-11-22T00:00:00Z
Annapolis_Annapolis Basin_Cornwallis_1.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Cornwallis,1.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
Annapolis_Annapolis Basin_Cornwallis_2.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Cornwallis,2.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_6.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Lobster Ledge,6.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
Annapolis_Annapolis Basin_Lobster Ledge_4.0_2020-11-22T00:00:00Z_2021-06-16T00:00:00Z,Annapolis Basin,Lobster Ledge,4.0,2020-11-22T00:00:00Z,2021-06-16T00:00:00Z
