<p style="font-size:1.4em;;"> Data Coverage &  Time Order & Gap Identification   Tests.</p>

In [1]:
# functions needed to run the notebook
import pandas as pd
import functions.common as cf
import functions.plotting as pf
import xarray as xr
from datetime import timedelta
import numpy as np
import datetime as dt
import netCDF4 as nc
from termcolor import colored

**1**
- **Get Datasets Review List**  
- method: get the list of data files for review for a local file created by the 2.0_data_review_list.ipynb

In [3]:
reviewlist = pd.read_csv('data_review_list_CP05MOAS-GL335-05-PARADM000.csv')
reviewlist.index = reviewlist['Unnamed: 0'].values
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(reviewlist)[['datasets']]

Unnamed: 0,datasets
deployment0001,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0001_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20141006T202152.905850-20141213T073238.247380.nc
deployment0002,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0002_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20151014T001900.237980-20151110T091855.472810.nc
deployment0003,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0003_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160404T185705.311220-20160417T235956.145260.nc
deployment0004,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0004_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160529T204727.075500-20160626T091401.747920.nc
deployment0005,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0005_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20170116T150223.595370-20170304T093047.153350.nc


**2**
- **Get Deployment Information**
- method: uses *refdes_datareview_json* function in the commom.py file under the local **/functions** directory 

In [4]:
col = list(reviewlist.columns)
refdes = reviewlist[col[1]][0].split('/')[-1].split('_')[1][0:27]
dr_data = cf.refdes_datareview_json(refdes)

tf = pd.DataFrame(dr_data['instrument']['deployments'])[['deployment_number','start_date','stop_date']]
tf.index = tf['deployment_number'].values
pd.set_option('display.max_colwidth', -1)
tf[['start_date','stop_date']]

Unnamed: 0,start_date,stop_date
1,2014-10-06T20:16:00+00:00,2014-12-15T00:00:00+00:00
2,2015-10-13T01:12:14+00:00,2015-11-16T00:00:00+00:00
3,2016-04-04T18:57:02+00:00,2016-04-18T00:00:00+00:00
4,2016-05-27T20:33:00+00:00,2016-06-27T00:00:00+00:00
5,2017-01-16T14:59:00+00:00,2017-03-06T22:45:00+00:00


**3**
- **Get Annotations**
- method: uses the output of the previous function *refdes_datareview_json*

In [9]:
tf = pd.DataFrame(dr_data['instrument']['annotations'])[['reference_designator','annotation','end_datetime','start_datetime']]
tf.index = tf['reference_designator'].values
pd.set_option('display.max_colwidth', -1)
tf[['annotation','end_datetime','start_datetime']]

Unnamed: 0,annotation,end_datetime,start_datetime
CP05MOAS-GL335,No data expected because of a leak identified in the bellophraham.,2015-11-15T19:00:00+00:00,2015-11-10T04:00:00+00:00
CP05MOAS-GL335,No data expected because of a leak caused by a ship strike.,2017-03-03T23:00:00+00:00,2017-03-03T22:00:00+00:00


**4**
- **Time Order in Data Files**
- <p style="color:red;">Pass/Fail: </p> Test if timestamps in the file are unique and in ascending order.


In [10]:
df = pd.DataFrame()
for ii in range(len(reviewlist)):
    deploy_num = int(reviewlist[col[0]][ii].split('t')[-1])
    method = reviewlist[col[2]][ii]
    stream = reviewlist[col[1]][ii].split('/')[-2].split('-')[-1]
    # Get time array
    ds = xr.open_dataset(reviewlist[col[1]][ii], mask_and_scale=False)
    ds = ds.swap_dims({'obs': 'time'})
    time = ds['time']
    
    # Check that the timestamps in the file are unique
    len_time = time.__len__()
    len_time_unique = np.unique(time).__len__()
    if len_time == len_time_unique:
        time_unique = 'pass'
    else:
        time_unique = 'fail'
        
    # Check that the timestamps in the file are in ascending order
    time_in = [dt.datetime.utcfromtimestamp(np.datetime64(x).astype('O')/1e9) for x in time.values]
    time_data = nc.date2num(time_in, 'seconds since 1900-01-01')

    # Create True/False list for every timestamps
    result = [(time_data[k + 1] - time_data[k]) > 0 for k in range(len(time_data) - 1)]

    # List indices when time is not increasing
    if result.count(True) == len(time) - 1:
        time_ascending = 'pass'
    else:
        ind_fail = {k: time_in[k] for k, v in enumerate(result) if v is False}
        time_ascending = 'fail: {}'.format(ind_fail)
        
    df0 = pd.DataFrame({'Delivery Method': [method],
                        'Data Stream': [stream],
                        'Unique Test': [time_unique],
                        'Ascending Test': [time_ascending]                                      
                        }, index=[deploy_num])

    df = df.append(df0)

pd.set_option('display.max_colwidth', -1)
(df)

Unnamed: 0,Delivery Method,Data Stream,Unique Test,Ascending Test
1,recovered_host,parad_m_glider_recovered,pass,pass
2,recovered_host,parad_m_glider_recovered,pass,pass
3,recovered_host,parad_m_glider_recovered,pass,pass
4,recovered_host,parad_m_glider_recovered,pass,pass
5,recovered_host,parad_m_glider_recovered,pass,pass


**5**
- **Files Data Coverage:**
<p style="color:blue;">$$\frac{File Days}{Deployment Days} \% $$ </p> 


<p style="color:green;">Deployment Days</p>Number of days the instrument was deployed.
<p style="color:green;">File Days</p>Number of days for which there is at least 1 timestamp available for the instrument.
<p style="color:green;">Start Gap</p>Number of missing days at the start of a deployment: comparison of the deployment start date to the data start date.
<p style="color:green;">End Gap</p>Number of missing days at the end of a deployment: comparison of the deployment end date to the data end date.
<p style="color:green;">Timestamps</p>Number of timestamps in a data file.
<p style="color:green;">Sampling Rate</p>
Sampling rates are calculated from the differences in timestamps. The most common sampling rate is that which occurs >50%.

In [6]:
df = pd.DataFrame()
for ii in range(len(reviewlist)):         
    deploy_num = int(reviewlist[col[0]][ii].split('t')[-1])
    method = reviewlist[col[2]][ii]
    stream = reviewlist[col[1]][ii].split('/')[-2].split('-')[-1]
    deploy_info = cf.get_deployment_information(dr_data, deploy_num)
    deploy_depth = deploy_info['deployment_depth']
    
    # Calculate days deployed
    deploy_start = str(deploy_info['start_date'])
    deploy_stop = str(deploy_info['stop_date']) 
#     print('{}{} - {}{}'.format('Data Start Date: ', deploy_start,'Data End Date: ',deploy_stop))
    if deploy_stop != 'None':
        r_deploy_start = pd.to_datetime(deploy_start).replace(hour=0, minute=0, second=0)
        if deploy_stop.split('T')[1] == '00:00:00':
            r_deploy_stop = pd.to_datetime(deploy_stop)
        else:
            r_deploy_stop = (pd.to_datetime(deploy_stop) + timedelta(days=1)).replace(hour=0, minute=0, second=0)
        n_days_deployed = (r_deploy_stop - r_deploy_start).days
    else:
        n_days_deployed = None
    
    # Get time array
    ds = xr.open_dataset(reviewlist[col[1]][ii], mask_and_scale=False)
    ds = ds.swap_dims({'obs': 'time'})
    time = ds['time']
    
    # Check that the timestamps in the file are unique
    len_time = time.__len__()
    len_time_unique = np.unique(time).__len__()
    
    # calculate gaps size at start of deployment    
    start_gap = (pd.to_datetime(str(time.values[0])) - r_deploy_start).days
   
    # calculate gap size at end of deployment
    end_gap = (r_deploy_stop - pd.to_datetime(str(time.values[-1]))).days    
    
    # Count the number of days for which there is at least 1 timestamp    
    n_days = len(np.unique(time.values.astype('datetime64[D]')))
    time_df = pd.DataFrame(time.values, columns=['time'])
    
    # Calculate the sampling rate to the nearest second
    time_df['diff'] = time_df['time'].diff().astype('timedelta64[s]')
    rates_df = time_df.groupby(['diff']).agg(['count'])
    n_diff_calc = len(time_df) - 1
    rates = dict(n_unique_rates=len(rates_df), common_sampling_rates=dict())
    for i, row in rates_df.iterrows():
        percent = (float(row['time']['count']) / float(n_diff_calc))
        if percent > 0.1:
            rates['common_sampling_rates'].update({int(i): '{:.2%}'.format(percent)})
    sampling_rt_sec = None
    for k, v in rates['common_sampling_rates'].items():
        if float(v.strip('%')) > 50.00:
            sampling_rt_sec = k

    if not sampling_rt_sec:
        sampling_rt_sec = 'no consistent sampling rate: {}'.format(rates['common_sampling_rates']) 
        
    df0 = pd.DataFrame({
                        'Delivery Method': [method],    
                        'Data Stream': [stream],
                        'Deployment Days': [n_days_deployed],
                        'File Days': [n_days], 
                        'Timestamps': [len_time],
                        'sampling Rate (s)': [sampling_rt_sec],
                        'Start Gap': [start_gap],
                        'End Gap': [end_gap],
                        'Data Coverage (%)': [round((n_days*100)/n_days_deployed)]        
                        }, index=[deploy_num])

    df = df.append(df0)
pd.set_option('display.max_colwidth', -1)
(df)

Unnamed: 0,Delivery Method,Data Stream,Deployment Days,File Days,Timestamps,sampling Rate (s),Start Gap,End Gap,Data Coverage (%)
1,recovered_host,parad_m_glider_recovered,71,69,3730926,1,0,2,97
2,recovered_host,parad_m_glider_recovered,35,28,1116293,1,1,6,80
3,recovered_host,parad_m_glider_recovered,15,14,483042,1,0,1,93
4,recovered_host,parad_m_glider_recovered,32,29,1536288,1,2,1,91
5,recovered_host,parad_m_glider_recovered,50,48,3352550,1,0,2,96


**Note**
<p style="color:red;">No annotation in the system to explain the gaps at the end and start of deployments.</p>

**6**
- **Identify Gaps in Data Files:**
<p style="color:green;">End Gap</p>
Number of missing days at the end of a deployment: comparison of the deployment end date to the data end date.
<p style="color:green;">Gaps Count</p>
Number of gaps within a data file (exclusive of missing data at the beginning and end of a deployment). Gap is defined as >1 day of missing data.
<p style="color:green;">Gap Days</p>
Number of days of missing data within a data file (exclusive of missing data at the beginning and end of a deployment).

In [7]:
df = pd.DataFrame()
for ii in range(len(reviewlist)):
    deploy_num = int(reviewlist[col[0]][ii].split('t')[-1])
    method = reviewlist[col[2]][ii]
    stream = reviewlist[col[1]][ii].split('/')[-2].split('-')[-1]
    # Get time array
    ds = xr.open_dataset(reviewlist[col[1]][ii], mask_and_scale=False)
    ds = ds.swap_dims({'obs': 'time'})
    time = ds['time']
    
    # Get a list of data gaps >1 day    
    time_df = pd.DataFrame(time.values, columns=['time'])
    gap_list = cf.timestamp_gap_test(time_df)
    df0 = pd.DataFrame({
                        'Delivery Method': [method], 
                        'Data Stream': [stream],
                        'Gap List': [gap_list],
                        'Gap Days': [int(len(gap_list))],
                        }, index=[deploy_num])

    df = df.append(df0) 
pd.set_option('display.max_colwidth', -1)
(df)

Unnamed: 0,Delivery Method,Data Stream,Gap List,Gap Days
1,recovered_host,parad_m_glider_recovered,[],0
2,recovered_host,parad_m_glider_recovered,[],0
3,recovered_host,parad_m_glider_recovered,[],0
4,recovered_host,parad_m_glider_recovered,[],0
5,recovered_host,parad_m_glider_recovered,[],0


<p style="color:green;">Note</p>

No data gaps greater than a day identified

**Summary of Results**

In [5]:
df = cf.time_gap_test(reviewlist, col, dr_data)
pd.set_option('display.max_colwidth', -1)
print(colored('Instrument', 'green'), colored(refdes, 'blue'))
(df)

[32mInstrument[0m [34mCP05MOAS-GL335-05-PARADM000[0m


Unnamed: 0,Delivery Method,Data Stream,Deployment Days,File Days,Timestamps,Start Gap,End Gap,Gap List,Gap Days,Sampling Rate(s),Time Order,Data Coverage(%)
1,recovered_host,parad_m_glider_recovered,71,69,3730926,0,2,[],0,1,"[Unique: pass, Ascending: pass]",97
2,recovered_host,parad_m_glider_recovered,35,28,1116293,1,6,[],0,1,"[Unique: pass, Ascending: pass]",80
3,recovered_host,parad_m_glider_recovered,15,14,483042,0,1,[],0,1,"[Unique: pass, Ascending: pass]",93
4,recovered_host,parad_m_glider_recovered,32,29,1536288,2,1,[],0,1,"[Unique: pass, Ascending: pass]",91
5,recovered_host,parad_m_glider_recovered,50,48,3352550,0,2,[],0,1,"[Unique: pass, Ascending: pass]",96


**Notes**

- <p style="color:green;">Time Order:</p> 
    - The time array in the data files are unique and in ascending order.
    
- <p style="color:green;">Data Coverage:</p>
    - Data coverage is good between 80 and 97 %.
    
- <p style="color:green;">Data Gaps:</p>
    - The data files are gap free, except for the gaps at the end and start of deployments.
    - Gaps identified are not annotated in the system.


**End**
- Link to the instrument report page:
https://datareview.marine.rutgers.edu/instruments/report/CP05MOAS-GL335-05-PARADM000