# Valid Data Tests For Science Variables
*Written by Leila Belabbassi and Lori Garzio, Rutgers University*

Purpose: demonstrate the automated tools used to evaluate data values for science variables from glider data streams. We will use the PAR sensor on Pioneer Glider 335 (CP05MOAS-GL335-05-PARADM000) for this example.

In [1]:
# functions and packages needed to run the notebook
import xarray as xr
import numpy as np
import pandas as pd
import functions.common as cf
from IPython.display import Image
from IPython.core.display import HTML

**Step 1: Get Dataset Review List**  
- Get the list of data files for review from the local file created using the 2.0_data_review_list.ipynb notebook

In [4]:
reviewlist = pd.read_csv('data_review_list_CP05MOAS-GL335-05-PARADM000.csv')
pd.set_option('display.max_colwidth', -1)
(reviewlist)

Unnamed: 0.1,Unnamed: 0,datasets,method
0,deployment0001,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0001_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20141006T202152.905850-20141213T073238.247380.nc,recovered_host
1,deployment0002,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0002_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20151014T001900.237980-20151110T091855.472810.nc,recovered_host
2,deployment0003,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0003_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160404T185705.311220-20160417T235956.145260.nc,recovered_host
3,deployment0004,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0004_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20160529T204727.075500-20160626T091401.747920.nc,recovered_host
4,deployment0005,https://opendap.oceanobservatories.org/thredds/dodsC/ooi/lgarzio@marine.rutgers.edu/20190509T131304-CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered/deployment0005_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_glider_recovered_20170116T150223.595370-20170304T093047.153350.nc,recovered_host


**Step 2: Define Valid Data**
- Data that are not:
    - NaNs 
    - fill values
    - outside global ranges (defined by the OOI system)
    - outside 5 standard deviations
        
- The output bins the percentage of valid data into the following bins: 

| Bin | valid data (%)  |
|-----|:---------------:|
| 99  | > 99            |
| 95  | 95 - 99         |
| 75  | 75 - 95         |
| 50  | 50 - 75         |
| 25  | 25 - 50         |
| 0   | 00 - 25         |


- A data file Valid Test output example:
    - {'99':4, '95':1} means the data file includes 5 science variables of which 4 have > 99% valid data points, and 1 has between 95-99% valid data points.
    
Run the function to evaluate the science data parameters for each deployment of CP05MOAS-GL335-05-PARADM000:

In [5]:
df = pd.DataFrame()
dr = pd.DataFrame()
valid_list = list(np.zeros(len(reviewlist)))
for index, row in reviewlist.iterrows():
    rd = row['datasets'].split('/')[-1].split('_')[1][0:27]
    stream = row['datasets'].split('/')[-1].split('-')[5].split('_2')[0]
    sci_vars = cf.return_science_vars(stream)
    ds = xr.open_dataset(row['datasets'], mask_and_scale=False)
    ds = ds.swap_dims({'obs': 'time'})

    # calculate statistics for science variables, excluding outliers +/- 5 SD    
    index_i = row['Unnamed: 0']
    valid_list_index = []
    for sv in sci_vars:
        valid_sci_dic = cf.validate_sci_var_report(rd, sv, ds, index_i, valid_list_index)       
        df = df.append(valid_sci_dic)     
    valid_list[index] = df['pvd_test'].values[-1]
    dr0 = pd.DataFrame({'valid_list': [valid_list[index]]}, index=[index_i])
    dr = dr.append(dr0)
dr

Unnamed: 0,valid_list
deployment0001,"{'99': 1, '25': 1, '0': 1}"
deployment0002,"{'99': 1, '25': 1, '0': 1}"
deployment0003,{'99': 3}
deployment0004,"{'99': 2, '0': 1}"
deployment0005,"{'99': 1, '0': 2}"


**Step 3: Expand on the Valid Data Test (percent_valid_data)** 
- calcualte the number of data points that fall into the folllowing categories:
    - NaNs (n_nan)
    - fill values (n_fv)
    - outside global ranges (n_gr)
    - outside 5 standard deviations (num_outliers)
        
- calculate basic statistics on the valid data sets:
    - minimum (vmin)
    - maximum (vmax)
    - average (mean)
    - standard deviation (sd)

In [5]:
pd.set_option('display.max_colwidth', -1)
df.drop(columns=['fv:fill_value','pvd_test', 'dlst'])

Unnamed: 0,sv,var_units,n_fv,gr: global_range,n_gr,n_nan,num_outliers,n_stats,mean,vmin,vmax,sd,percent_valid_data
deployment0001,sci_bsipar_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",3699171,0,3699171.0,31755,792.534,0.0,1784.76,886.777,0.85
deployment0001,sci_bsipar_temp,ºC,0,"[-2.0, 40.0]",0,0,1022.0,3729904,16.6991,11.5,23.94,2.2236,99.97
deployment0001,parad_m_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",2288817,0,2298262.0,1432664,13.2937,0.0,363.427,36.8723,38.4
deployment0002,sci_bsipar_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",1088649,0,1088649.0,27644,691.979,0.0,1784.76,869.587,2.48
deployment0002,sci_bsipar_temp,ºC,0,"[-2.0, 40.0]",0,0,1112.0,1115181,17.506,12.74,23.53,1.6865,99.9
deployment0002,parad_m_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",644249,0,647811.0,468482,18.5074,0.0,574.032,56.1156,41.97
deployment0003,sci_bsipar_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",77,0,3534.0,479508,9.2554,0.0,370.293,34.8497,99.27
deployment0003,sci_bsipar_temp,ºC,0,"[-2.0, 40.0]",0,0,486.0,482556,12.455,7.76,15.65,2.159,99.9
deployment0003,parad_m_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",0,0,2732.0,480310,0.0,0.0,0.0004,0.0,99.43
deployment0004,sci_bsipar_par,µmol photons m-2 s-1,0,"[0.0, 2500.0]",1535163,0,1535163.0,1125,0.0,0.0,0.0,0.0,0.07


**Step 4: Human In The Loop Observations**

There are two variables that are labeled Photosynthetically Active Radiation: sci_bsipar_par and parad_m_par. sci_bsipar_par is the unscaled version and parad_m_par is the scaled (final) version. sci_bsipar_par can be (but is not always) orders of magnitude higher than parad_m_par.

The global ranges in the system are the same for both of these variables [0.0, 2500.0], even though sci_bsi_par can legitimately be much higher. A recommendation in the [RU Data Review Portal](https://datareview.marine.rutgers.edu/instruments/report/CP05MOAS-GL335-05-PARADM000) for PAR is to review these global ranges. This is why sci_bsipar_par often has a very low percentage of valid data in the table above (because the values are often much higher than the stated global range).

**Science Parameter: Photosynthetically Active Radiation (parad_m_par)

- observation (1)
    - Deployment 1 data ranges are reasonable. [Image 1]
    - Deployment 3 data are orders of magnitude too low (max value = 0.0004). [Image 2]  
    - *For deployment 3, sci_bsipar_par values are reasonable, so the scaling factor should be reviewed.
          
- observation (2)
    - Deployment 1 Less than 40% of the data are valid. [Image 3]
    - *most of the data are less than 0 outside the set value for minimum global ranges.

**Image 1**

In [6]:

Image(url= "https://marine.rutgers.edu/cool/ooi/data-eval/data_review/CP/CP05MOAS/CP05MOAS-GL335/CP05MOAS-GL335-05-PARADM000/profile_plots/deployment0001/all_data_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_par.png")

**Image 2**

In [7]:
Image(url= "https://marine.rutgers.edu/cool/ooi/data-eval/data_review/CP/CP05MOAS/CP05MOAS-GL335/CP05MOAS-GL335-05-PARADM000/profile_plots/deployment0003/all_data_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_par.png")

**Image 3**

In [8]:
Image(url= "https://marine.rutgers.edu/cool/ooi/data-eval/data_review/CP/CP05MOAS/CP05MOAS-GL335/CP05MOAS-GL335-05-PARADM000/profile_plots/deployment0001/all_data_CP05MOAS-GL335-05-PARADM000-recovered_host-parad_m_par-zoom.png")

**END**

- Visit the instrument report page for more information:

(https://datareview.marine.rutgers.edu/instruments/report/CP05MOAS-GL335-05-PARADM000)