## Time Coverage Review Process.

1. This notebook will review the time array in the data file and compare it to the start and end date/time information in the deployment file.  The start and end date/time correspond to the time the sensor was put in the water and the time it was removed from the water, respectively.


2. The list of tests included in this notebook are described below:


- Timestamps Order.

<blockquote> Time Order: Test that timestamps in the file are unique and in ascending order.</blockquote>



- Sampling Rate Consistency.

<blockquote> Sampling Rate:	Sampling rates are calculated from the differences in timestamps. The most common sampling rate is that which occurs >50%.</blockquote>



- Start/End Time Gap.

<blockquote> Start Gap:	Number of missing days at the start of a deployment: comparison of the deployment start date to the data start date.</blockquote>


<blockquote>End Gap:	Number of missing days at the end of a deployment: comparison of the deployment end date to the data end date.</blockquote>


- Gap Identification.

<blockquote> 
Gap Count:	Number of gaps within a data file (exclusive of missing data at the beginning and end of a deployment). Gap is defined as >1 day of missing data.</blockquote>

<blockquote>Gap Days:	Number of days of missing data within a data file (exclusive of missing data at the beginning and end of a deployment).</blockquote>

### Outline.
-	[Python Packages.](#1)
-	[Load Data File.](#2)
-	[Extract Time Array.](#3)
-   [Timestamps Order.](#4)
    -	[Check if Timestamps are Unique.](#41)
    -	[Check if Timestamps are in Ascending Order.](#42)
-   [Sampling Rate Consistency.](#8)
-   [Start/End Time Gap.](#5)
    -	[Check the Start/End Time.](#51)
-   [Gap Identification.](#6)
-	[Identify Gaps in the Time Array](#61)
-	[Time Coverage Evaluation Summary.](#7)

<a id="1"></a>
### Python Packages.

In [2]:
import xarray as xr
import pandas as pd
import numpy as np
import datetime as dt
import netCDF4 as nc
from datetime import timedelta

<a id="2"></a>
### Load Data File.

In [40]:
# This is how you change your directory to where your data file is stored:
# Using Mac:
%cd '/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/Module3_DataFiles_telemetered-GP03FLMB-RIM01-02-CTDMOG060/'

# Using Windows: 
#%cd H:\test  (Line commented out (#) so it is not executed)

# List the files in the current directory 
%ls

/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/Module3_DataFiles_telemetered-GP03FLMB-RIM01-02-CTDMOG060
deployment0001_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20130724T100001-20140227T140001.nc
deployment0002_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20140620T040001-20141109T000001.nc
deployment0003_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20150609T000001-20160209T220001.nc
deployment0004_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20161008T080001-20161219T000001.nc
deployment0007_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20190928T000001-20200118T200001.nc


In [8]:
filename = 'deployment0004_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20161008T080001-20161219T000001.nc'

# Load data
file_content = xr.open_dataset(filename,mask_and_scale=False) 
file_content

<a id="3"></a>
### Extract Time Array.

In [42]:
time = file_content['time']
time

<a id="4"></a>
### Timestamps Order.
<a id="41"></a>
#### Check If Timestamps are Unique.
1. Get the number of timestamps in a data file.
2. Get the number of unique timestamps.
3. Compare the number of timestamps to the number of unique timestamps.

In [11]:
(1)
# Get the number of timestamps in a data file.
len_time = time.__len__()
len_time

108

In [12]:
# (2)
# Get the number of unique timestamps.
len_time_unique = np.unique(time).__len__()
len_time_unique

108

In [37]:
# (3)
# Compare the lengths
if len_time == len_time_unique:
    time_unique = 'pass'
else:
    time_unique = 'fail'

time_unique

'pass'

<a id="42"></a>
#### Check if Timestamps  are in Ascending Order.
1. Convert Time.
2. Create True/False List for Every Timestamps.
3. List Indices When Time is not Increasing.

In [43]:
# (1)
# convert time
time_in = [dt.datetime.utcfromtimestamp(np.datetime64(x).astype('O')/1e9) \
                                       for x in time.values]

time_data = nc.date2num(time_in, 'seconds since 1900-01-01')
print(pd.DataFrame(time_data)) # print using pandas dataframe

2016-10-08 08:00:01
                0
0    3.684902e+09
1    3.684917e+09
2    3.685824e+09
3    3.685838e+09
4    3.685853e+09
..            ...
103  3.690907e+09
104  3.691022e+09
105  3.691037e+09
106  3.691051e+09
107  3.691094e+09

[108 rows x 1 columns]


In [16]:
# (2) Create True/False list for every timestamps
result = [(time_data[k + 1] - time_data[k]) > 0 for k in range(len(time_data) - 1)]

print(pd.DataFrame(result))


        0
0    True
1    True
2    True
3    True
4    True
..    ...
102  True
103  True
104  True
105  True
106  True

[107 rows x 1 columns]


In [17]:
# (3) 
# List indices when time is not increasing
if result.count(True) == len(time) - 1:
    time_ascending = 'pass'
else:
    ind_fail = {k: time_in[k] for k, v in enumerate(result) \
                if v is False}
    time_ascending = 'fail: {}'.format(ind_fail)
print(time_ascending)

pass


<a id="8"></a>
### Sampling Rate Consistency.
1. Calculate the sampling rate to the nearest second.


    a. Calculate the differences in timestamps.    
    b. Group Sampling Rates.   
 
 
2. Count the number of unique sampling rates.
3. Extract the common sampling rate and the percent of its occurence in the dataset.

In [19]:
# (1) 
# Calculate the sampling rate to the nearest second.
# (a) Sampling rates are calculated from the differences in timestamps. 

time_df = pd.DataFrame(time.values, columns=['time'])
time_df['diff'] = time_df['time'].diff().astype('timedelta64[s]')
time_df

Unnamed: 0,time,diff
0,2016-10-08 08:00:01,
1,2016-10-08 12:00:01,14400.0
2,2016-10-19 00:00:01,907200.0
3,2016-10-19 04:00:01,14400.0
4,2016-10-19 08:00:01,14400.0
...,...,...
103,2016-12-16 20:00:01,14400.0
104,2016-12-18 04:00:01,115200.0
105,2016-12-18 08:00:01,14400.0
106,2016-12-18 12:00:01,14400.0


In [20]:
# (b) Group Sampling Rates.
rates_df = time_df.groupby(['diff']).agg(['count'])
rates_df

Unnamed: 0_level_0,time
Unnamed: 0_level_1,count
diff,Unnamed: 1_level_2
14400.0,73
28800.0,5
43200.0,4
57600.0,4
72000.0,2
86400.0,2
100800.0,1
115200.0,2
129600.0,3
144000.0,1


In [21]:
# (2)
# Count the number of unique sampling rates.
# Get the most common sampling rate in the next cell
rates = dict(n_unique_rates=len(rates_df), common_sampling_rates=dict())
rates

{'n_unique_rates': 18, 'common_sampling_rates': {}}

In [61]:
# (3)
# Extract:
# common sampling rate and 
# the percent of its occurence in the dataset.
for i, row in rates_df.iterrows():
    percent = (float(row['time']['count']) / float(len(time_df) - 1))
    if percent > 0.1:
        rates['common_sampling_rates'].update({int(i): '{:.2%}'.format(percent)})
rates

{'n_unique_rates': 18, 'common_sampling_rates': {14400: '68.22%'}}

<a id="5"></a>
### Start/End Time Gap.
<a id="51"></a>
#### Check the start/end time
1. File Days: Number of days for which there is at least 1 timestamp.

2. Deployment Days: Number of days of a deployment.

    a. Load Deployment File.    
    b. Extract Start and End Time.   
    c. Calculate the number of days of a deployment.
 
 
3. calculate gaps size at start of deployment.
4. calculate gaps size at end of deployment.

In [25]:
# (1) 
# File Days:
# Number of days for which there is at least 1 timestamp available for the instrument.
n_days = len(np.unique(time.values.astype('datetime64[D]')))

print('File Days: ', n_days)

File Days:  40


In [28]:
# (2)
# a. Load Deployment file
%cd '/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/Module3_cruise_info_GP03FLMB-RIM01-02-CTDMOG060/'
%ls

/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/Module3_cruise_info_GP03FLMB-RIM01-02-CTDMOG060
GP03FLMB-RIM01-02-CTDMOG060_info.csv   GP03FLMB-RIM01-02-CTDMOG060_info.xlsx


In [29]:
deployment_file = pd.read_csv('GP03FLMB-RIM01-02-CTDMOG060_info.csv')

# Extract the deployment number from the data file.
deployment_num = np.unique(file_content['deployment'])[0]
deployment_x = deployment_file[deployment_file['Deployment'] == deployment_num]

In [64]:
# b. Extract Start and End Time.
deploy_start = deployment_x['Start Date'].values[0] 
deploy_stop = deployment_x['Stop Date'].values[0]  

# Convert Time
r_deploy_start = pd.to_datetime(deploy_start)
r_deploy_stop = pd.to_datetime(deploy_stop)

print(deploy_start,r_deploy_start )
print(deploy_stop, r_deploy_stop)

2016-07-04 2016-07-04 00:00:00
2017-07-17 2017-07-17 00:00:00


In [31]:
# c. Number of days the instrument was deployed.
n_days_deployed = (r_deploy_stop - r_deploy_start).days

print('Deployment Days:', n_days_deployed)

Deployment Days: 378


In [32]:
# (3)
# calculate gaps size in days at start of deployment    
start_gap = (pd.to_datetime(str(time.values[0])) - r_deploy_start).days
print('Start Gap: ', start_gap)


# (4)
# calculate gap size in days at end of deployment
end_gap = (r_deploy_stop - pd.to_datetime(str(time.values[-1]))).days

print('End Gap: ', end_gap)

Start Gap:  96
End Gap:  209


<a id="6"></a>
### Gap Identification.
<a id="61"></a>
#### Identify Gaps in the Time Array
1. Gap Days: Number of days of missing data within a data file. 
2. Identify gaps with >1 day of missing data.
3. Extract the start and end dates of >1 day gaps.

In [33]:
# Gap Days: Number of days of missing data within a data file 
time_df['diff'] = time_df['time'].diff()
time_df

Unnamed: 0,time,diff
0,2016-10-08 08:00:01,NaT
1,2016-10-08 12:00:01,0 days 04:00:00
2,2016-10-19 00:00:01,10 days 12:00:00
3,2016-10-19 04:00:01,0 days 04:00:00
4,2016-10-19 08:00:01,0 days 04:00:00
...,...,...
103,2016-12-16 20:00:01,0 days 04:00:00
104,2016-12-18 04:00:01,1 days 08:00:00
105,2016-12-18 08:00:01,0 days 04:00:00
106,2016-12-18 12:00:01,0 days 04:00:00


In [34]:
# (2)
# Identify gaps with >1 day of missing data.
index_gap = time_df['diff'][time_df['diff'] > pd.Timedelta(days=1)].index.tolist()
print('Gap count: ','\n', (str(time_df['diff'][index_gap[:]])))
print('Gap Count: ', len(index_gap))

Gap count:  
 2     10 days 12:00:00
11     4 days 20:00:00
18     2 days 04:00:00
19     1 days 20:00:00
20     1 days 16:00:00
21     3 days 16:00:00
27     5 days 20:00:00
32     2 days 20:00:00
33     1 days 12:00:00
38     2 days 04:00:00
40     1 days 12:00:00
47     3 days 12:00:00
74     1 days 08:00:00
80     1 days 04:00:00
96     2 days 04:00:00
100    1 days 12:00:00
104    1 days 08:00:00
Name: diff, dtype: timedelta64[ns]
Gap Count:  17


In [35]:
# (3)
# Extract the start and end dates of >1 day gaps.
gap_list = []
for i in index_gap:
    gap_list.append([pd.to_datetime(str(time_df['time'][i-1])).strftime('%Y-%m-%dT%H:%M:%S'),
                     pd.to_datetime(str(time_df['time'][i])).strftime('%Y-%m-%dT%H:%M:%S')])

print('Gap List: ','\n', gap_list)

Gap List:  
 [['2016-10-08T12:00:01', '2016-10-19T00:00:01'], ['2016-10-20T16:00:01', '2016-10-25T12:00:01'], ['2016-10-26T12:00:01', '2016-10-28T16:00:01'], ['2016-10-28T16:00:01', '2016-10-30T12:00:01'], ['2016-10-30T12:00:01', '2016-11-01T04:00:01'], ['2016-11-01T04:00:01', '2016-11-04T20:00:01'], ['2016-11-07T00:00:01', '2016-11-12T20:00:01'], ['2016-11-14T16:00:01', '2016-11-17T12:00:01'], ['2016-11-17T12:00:01', '2016-11-19T00:00:01'], ['2016-11-21T08:00:01', '2016-11-23T12:00:01'], ['2016-11-24T04:00:01', '2016-11-25T16:00:01'], ['2016-11-28T00:00:01', '2016-12-01T12:00:01'], ['2016-12-05T20:00:01', '2016-12-07T04:00:01'], ['2016-12-08T00:00:01', '2016-12-09T04:00:01'], ['2016-12-12T00:00:01', '2016-12-14T04:00:01'], ['2016-12-14T20:00:01', '2016-12-16T08:00:01'], ['2016-12-16T20:00:01', '2016-12-18T04:00:01']]


<a id="8"></a>
## Time Coverage Evaluation Summary.

In [39]:
# Put the time coverage evaluation results in a dataframe.
df = pd.DataFrame()
df0 = pd.DataFrame({
                    'Deployment Days':[n_days_deployed],
                    'File Days': [n_days],
                    'Start Gap': [start_gap],
                    'End Gap': [end_gap],                    
                    'Data Coverage (%)': [round((n_days*100)/n_days_deployed)],
                    
                    'Timestamps': [len_time],
                    'Ascending Test':[time_ascending],
                    'Unique Test':[time_unique],   
                    'sampling Rate (s)': [rates],

                    'Gap Days': [len(index_gap)],
                    'Gap List': [gap_list]
                    }, index=['Results'])

df = df.append(df0)
pd.set_option('display.max_colwidth', None)
df.T

Unnamed: 0,Results
Deployment Days,378
File Days,40
Start Gap,96
End Gap,209
Data Coverage (%),11
Timestamps,108
Ascending Test,pass
Unique Test,pass
sampling Rate (s),"{'n_unique_rates': 18, 'common_sampling_rates': {14400: '68.22%'}}"
Gap Days,17


### Time Coverage Review

- The the time array seem to be of good quality.
- There are no duplicates, it is in order (ascending), and the sampling rate is stable.
- There are large gaps at the begining and the end of the deployment.
- The time coverage is very small, which requires an annotation by the system to let users know what happened with the sensor recording the data.
- There exist gaps in the available data ranging from 1 to 10 days, which also requires an explanation.


## END