### Compare Datasets From Different Delivery Methods

Written by Leila Belabbassi and Lori Garzio, Rutgers University

Purpose: demonstrate the automated tools used to compare data values with matching timestamps for science variables from glider data streams among all delivery methods. We will use the PAR sensor on Pioneer Glider 335 (CP05MOAS-GL335-05-PARADM000) for this example.

In [1]:
import xarray as xr
import numpy as np
import pandas as pd
from datetime import timedelta
import functions.common as cf
from IPython.display import Image
from IPython.core.display import HTML

**Step 1: Get Data File List**  


- Get the list of data files that include data from all data delivery methods from the local file created using the 2.0_data_review_list.ipynb notebook. 

In [2]:
file_path = '/Users/leila/Documents/NSFEduSupport/github/8thEGOMeeting-notebooks/data_files_list_CP05MOAS-GL335-05-PARADM000.csv'

In [3]:
datafiles = pd.read_csv(file_path)
pd.set_option('display.max_colwidth', -1)
(datafiles)

Unnamed: 0.1,Unnamed: 0,datasets,method
0,deployment0001,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument/deployment0001_CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument_20141006T202152.905850-20141213T035750.235320.nc,telemetered
1,deployment0002,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument/deployment0002_CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument_20151014T001900.237980-20151110T091836.233310.nc,telemetered
2,deployment0003,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument/deployment0003_CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument_20160404T185705.311220-20160417T235956.145260.nc,telemetered
3,deployment0004,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument/deployment0004_CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument_20160527T212312.351560-20160626T091401.747920.nc,telemetered
4,deployment0005,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument/deployment0005_CP05MOAS-GL335-05-PARADM000-telemetered-parad_m_glider_instrument_20170116T150223.595370-20170304T045334.799840.nc,telemetered
5,deployment0001,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0001_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20141006T202152.905850-20141213T073238.247380.nc,recovered_host
6,deployment0002,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0002_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20151014T001900.237980-20151110T091855.472810.nc,recovered_host
7,deployment0003,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0003_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160404T185705.311220-20160417T235956.145260.nc,recovered_host
8,deployment0004,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0004_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160529T204727.075500-20160626T091401.747920.nc,recovered_host
9,deployment0005,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0005_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20170116T150223.595370-20170304T093047.153350.nc,recovered_host


**Step 2: Get Dataset Review List**  
- Get the list of data files for review from the local file created using the 2.0_data_review_list.ipynb notebook

In [4]:
file_path = '/Users/leila/Documents/NSFEduSupport/github/8thEGOMeeting-notebooks/data_review_list_CP05MOAS-GL335-05-PARADM000.csv'

In [5]:
reviewfiles = pd.read_csv(file_path)
pd.set_option('display.max_colwidth', -1)
reviewfiles

Unnamed: 0.1,Unnamed: 0,datasets,method
0,deployment0001,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0001_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20141006T202152.905850-20141213T073238.247380.nc,recovered_host
1,deployment0002,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0002_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20151014T001900.237980-20151110T091855.472810.nc,recovered_host
2,deployment0003,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0003_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160404T185705.311220-20160417T235956.145260.nc,recovered_host
3,deployment0004,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0004_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160529T204727.075500-20160626T091401.747920.nc,recovered_host
4,deployment0005,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0005_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20170116T150223.595370-20170304T093047.153350.nc,recovered_host


**Step 3: Get Datasets Comparaison Status**  
- Test if there exist another data delivery method to compare to the review list datasets.

In [6]:
deploy_list = pd.unique(datafiles['Unnamed: 0'])

In [7]:
d_info = pd.DataFrame()
d_dia = pd.DataFrame()
for item in deploy_list: 
    datalist = datafiles['datasets'][datafiles['Unnamed: 0'] == item]
    method = datafiles['method'][datafiles['Unnamed: 0'] == item]
    
    if len(datalist) == 1:
        comparison_status = 'No comparison: Data exist from only 1 delivery method.'
    elif len(datalist) > 1 & len(datalist) <= 3:
        comparison_status = 'Do comparison: Data exist from 2 different delivery methods.'
    else:
        comparison_status = 'No comparison: Data exist for more than 3 delivery methods. Please provide fewer datasets for analysis.'
    
    umethods = [u.split('/')[-1].split('-')[4] for u in datalist]
    ustreams = [u.split('/')[-2].split('-')[-1] for u in datalist]
           
    d0 = pd.DataFrame({'comparison_status': [comparison_status],
                       'methods': [umethods],
                       'streams':[ustreams], 
                       'datasets': [datalist.values]
                      }, index=[item])
    d_info = d_info.append(d0)
    
#     d_dia = d_dia.append(d0['note']) #, index=[d0.index.values]
pd.set_option('display.max_colwidth', -1)
d_info[['comparison_status', 'methods', 'streams']]


Unnamed: 0,comparison_status,methods,streams
deployment0001,Do comparison: Data exist from 2 different delivery methods.,"[telemetered, recovered_host]","[parad_m_glider_instrument, parad_m_glider_recovered]"
deployment0002,Do comparison: Data exist from 2 different delivery methods.,"[telemetered, recovered_host]","[parad_m_glider_instrument, parad_m_glider_recovered]"
deployment0003,Do comparison: Data exist from 2 different delivery methods.,"[telemetered, recovered_host]","[parad_m_glider_instrument, parad_m_glider_recovered]"
deployment0004,Do comparison: Data exist from 2 different delivery methods.,"[telemetered, recovered_host]","[parad_m_glider_instrument, parad_m_glider_recovered]"
deployment0005,Do comparison: Data exist from 2 different delivery methods.,"[telemetered, recovered_host]","[parad_m_glider_instrument, parad_m_glider_recovered]"


**Step 4: Compare Datasets from Different Delivery Methods**  

Run the function to compare data values with matching timestamps for science variables among all delivery methods of CP05MOAS-GL335-05-PARADM000:

In [26]:
df =  pd.DataFrame()
df_sum =  pd.DataFrame()
for ii in range(len(d_info)): 
    index_i = d_info.index.values[ii]
    print(index_i)
    if len(d_info['methods'].values[ii]) == 2:
        print("comparing data values with matching timestamps")
        ds0 = xr.open_dataset(d_info['datasets'].values[ii][0])
        ds0 = ds0.swap_dims({'obs': 'time'})        
        ds0_sci_vars = cf.return_science_vars(d_info['streams'].values[ii][0])
        ds0_method = d_info['methods'].values[ii][0]
        
        ds1 = xr.open_dataset(d_info['datasets'].values[ii][1])
        ds1 = ds1.swap_dims({'obs': 'time'})        
        ds1_sci_vars = cf.return_science_vars(d_info['streams'].values[ii][1])
        ds1_method = d_info['methods'].values[ii][1]
        
        # define preferred method
        if reviewfiles['method'].values[ii] == ds0_method:
            preferred_method = 'ds0'
            preferred_stream_name = d_info['streams'].values[ii][0]
            
        elif reviewfiles['method'].values[ii] == ds1_method: 
            preferred_method = 'ds1'
            preferred_stream_name = d_info['streams'].values[ii][1]
         
        print('preferred_method: ', preferred_method, '(', preferred_stream_name, ')')
        # find where the variable long names are the same
        ds0names = cf.long_names(ds0, ds0_sci_vars)
        ds0names.rename(columns={'name': 'name_ds0'}, inplace=True)
        
        ds1names = cf.long_names(ds1, ds1_sci_vars)
        ds1names.rename(columns={'name': 'name_ds1'}, inplace=True)
        
        mapping = pd.merge(ds0names, ds1names, on='long_name', how='inner')
        print(mapping)
        df, missing_data_list, diff_gzero_list, var_list = cf.compare_datasets(df,
                                                        mapping, preferred_method,
                                              index_i, ds0, ds0_method, ds1, ds1_method)
    else:
        print(d_info['note'].values[ii])
    
    fd_test = cf.found_data_in_another_stream(missing_data_list)
 
    comparison_details = cf.found_data_in_another_stream_diff(fd_test, 
                                                              diff_gzero_list, var_list)
    
    df_sum0 = pd.DataFrame({'data_comparison': [fd_test],
                            'comparison_details': [comparison_details]}
                           , index= [d_info.index.values[ii]])
    df_sum = df_sum.append(df_sum0)

deployment0001
comparing data values with matching timestamps
         name_ds0                            long_name        name_ds1
0  sci_bsipar_par  Photosynthetically Active Radiation  sci_bsipar_par
1  sci_bsipar_par  Photosynthetically Active Radiation  parad_m_par   
2  parad_m_par     Photosynthetically Active Radiation  sci_bsipar_par
3  parad_m_par     Photosynthetically Active Radiation  parad_m_par   
deployment0002
comparing data values with matching timestamps
         name_ds0                            long_name        name_ds1
0  sci_bsipar_par  Photosynthetically Active Radiation  sci_bsipar_par
1  sci_bsipar_par  Photosynthetically Active Radiation  parad_m_par   
2  parad_m_par     Photosynthetically Active Radiation  sci_bsipar_par
3  parad_m_par     Photosynthetically Active Radiation  parad_m_par   
deployment0003
comparing data values with matching timestamps
         name_ds0                            long_name        name_ds1
0  sci_bsipar_par  Photosynthetic

In [9]:
# df = df[df['unit_test'] != 'fail']
# df = df[df['unit'] != 'bar']

In [30]:
pd.set_option('display.max_colwidth', -1)
df

Unnamed: 0,stream,parameter,unit,unit_test,n,n_nan,missing_data,n_comparison,min_abs_diff,max_abs_diff,n_diff_greater_zero,percent_diff_greater_zero
deployment0001,parad_m_glider_recovered,sci_bsipar_par,µmol photons m-2 s-1,pass,3730926,0,no missing data,68916,0.0,7139.0,1.0,0.0
deployment0001,parad_m_glider_recovered,parad_m_par,µmol photons m-2 s-1,pass,3730926,0,no missing data,68916,0.0,4485760000.0,67574.0,98.05
deployment0001,parad_m_glider_recovered,sci_bsipar_par,µmol photons m-2 s-1,pass,3730926,0,no missing data,68916,0.0,4485760000.0,67574.0,98.05
deployment0001,parad_m_glider_recovered,parad_m_par,µmol photons m-2 s-1,pass,3730926,0,no missing data,68916,0.0,0.007139,0.0,0.0
deployment0002,parad_m_glider_recovered,sci_bsipar_par,µmol photons m-2 s-1,pass,1116293,0,no missing data,23133,0.0,0.0,0.0,0.0
deployment0002,parad_m_glider_recovered,parad_m_par,µmol photons m-2 s-1,pass,1116293,0,no missing data,23133,0.0,5014060000.0,21698.0,93.8
deployment0002,parad_m_glider_recovered,sci_bsipar_par,µmol photons m-2 s-1,pass,1116293,0,no missing data,23133,0.0,5014060000.0,21698.0,93.8
deployment0002,parad_m_glider_recovered,parad_m_par,µmol photons m-2 s-1,pass,1116293,0,no missing data,23133,0.0,0.0,0.0,0.0
deployment0003,parad_m_glider_recovered,sci_bsipar_par,µmol photons m-2 s-1,pass,483042,0,no missing data,11185,0.0,0.40871,0.0,0.0
deployment0003,parad_m_glider_recovered,parad_m_par,µmol photons m-2 s-1,pass,483042,0,no missing data,11185,0.0,4981.47,3196.0,28.57


In [11]:
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(df_sum['data_comparison']) 

Unnamed: 0,data_comparison
deployment0001,pass
deployment0002,pass
deployment0003,pass
deployment0004,fail: data found in another stream (gaps: [1] days: [3])
deployment0005,pass


From the summary table above, there are 3 days during deployment 4 where data are found in a data stream that is not the preferred stream. The details from the previous table show that, for each parameter there are 2457 data points found in the telemetered stream between 2016-5-27 and 2016-5-29 that are not found in the recovered_host data stream (See Image 1).

for further details visit the data review report page: https://datareview.marine.rutgers.edu/instruments/report/CP05MOAS-GL335-05-PARADM000

**Image 1:**

In [12]:
Image(url="https://marine.rutgers.edu/cool/ooi/data-eval/data_review/CP/CP05MOAS/CP05MOAS-GL335/CP05MOAS-GL335-05-PARADM000/method_compare_plots/deployment0004-recovered_host-parad_m_glider_recovered%20telemetered-parad_m_glider_instrument-2016-05-20to2016-06-01/deployment0004_CP05MOAS-GL335-05-PARADM000_Photosynthetically%20Active%20Radiation.png")

**END**

From the summary table above, there are 3 days during deployment 4 where data are found in a data stream that is not the preferred stream. The details from the previous table show that, for each parameter there are 2457 data points found in the telemetered stream between 2016-5-27 and 2016-5-29 that are not found in the recovered_host data stream. See the data review analysis for further details: https://datareview.marine.rutgers.edu/instruments/report/CP05MOAS-GL335-05-PARADM000