## Data Validation Process


1. This notebook demonstrates the process used to evaluate multiple parameters from different files created during a sequence of deployments of a moored instrument located in the Northern Pacific Ocean ([see Link for more information about the station](https://drive.google.com/open?id=1_kAzdVfou_PWG-UnwB7Nib4NpyktKsVg)).


2. The selected parameters of interest are evaluated using:

<blockquote> Data Validation Test: Calculates the number of data points that fall into the list below, and create an array of indices for valid datasets. 

- NaN
- Fill values
- Extreme Values
- Global Ranges
- Standard deviation </blockquote>

<blockquote> Basic Statistics Method: Calculates basic statistics on the valid datasets:
    
- Minimum 
- Maximum 
- Average 
- Standard deviation
- Percent good data</blockquote>


### Notebook Outline:

- [Python Packages.](#1)
- [Load Data Files Review List.](#2)
- [Create Functions To Run Data Files.](#3)
    - [Set Parameters List Function.](#31)
    - [Set Global Range Function.](.#32)
        - [Load Global Ranges File.](#321)
        - [Example.](#322)
    - [Set Valid Data Functions.](#33)
    - [Set Basic Statistics Function.](#34)
    - [Set Percent Valid Data Function.](#35)
- [Write a Data Validation Processing Method.](#4)
- [Summary of Findings.](#5)

<span style='color:Orange' size=20 > **Attention:** </span> 
- To run the notebook, you need to follow the septs in order.
- For the code cell, run the cell before you move on to the next one. 
    - **Remember**: The output of a cell may be an input in the next cell.

<a id=1 ></a>
### Python Packages.

In [1]:
# functions and packages needed to run the notebook
import xarray as xr
import numpy as np
import pandas as pd
import requests

<a id=2 ></a>
### Load Data Files Review List.
Get the list of data files for review from the local file created for this example.


In [2]:
%cd '/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/Module4_csvFiles'
reviewlist = pd.read_csv('data_review_list_GP03FLMB-RIM01-02-CTDMOG060_recovered.csv')
pd.set_option('display.max_colwidth', None)
(reviewlist)

/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/Module4_csvFiles


Unnamed: 0,files
0,deployment0001_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20130724T064501-20140617T234501.nc
1,deployment0003_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20150608T213001-20160703T183001.nc
2,deployment0004_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20160704T231501-20170717T150001.nc
3,deployment0005_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20170714T230001-20180725T170001.nc
4,deployment0006_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20180724T231501-20190927T234501.nc


<a id=3 ></a>
### Create Functions To Run Data Files.

<a id=31 ></a>
#### Set Parameters List Function.
- Use **return_science_vars** function to retrieve the parameters for the validation process.
- The **requests** python function gets the URL for the json file from the system database.
- The function **.json()** read the file and drill down to the parameter data product type list to select 'Science Data'. ([See a visual here of the file hierarchy](https://drive.google.com/open?id=1WHarAjyGJObgl2V-XyYC8P9-eJtSHSge)).

In [3]:
def return_science_vars(stream):
    """
    Return only the science data parameters.
    Example URL:
    http://datareview.marine.rutgers.edu/streams/view/ctdmo_ghqr_instrument_recovered.json
    """
    sci_vars = []
    dr = 'http://datareview.marine.rutgers.edu/streams/view/{}.json'.format(stream)
    r = requests.get(dr)
    params = r.json()['stream']['parameters']
    for p in params:
        if p['data_product_type'] == 'Science Data':
            sci_vars.append(p['name'])
    return sci_vars

<a id=311></a>
##### Example

In [4]:
# How to use return_science_vars function:
return_science_vars('ctdmo_ghqr_instrument_recovered')

['density',
 'practical_salinity',
 'ctdmo_seawater_pressure',
 'ctdmo_seawater_temperature',
 'ctdmo_seawater_conductivity']

<a id=32 ></a>
#### Set Global Range Function.
For a given location and parameter:
- return the minimum and maximum values of the global range.

In [5]:
def get_global_ranges(GR_file, ReferenceDesignator, ParameterID):
    
    row = GR_file[(GR_file['ReferenceDesignator'] == ReferenceDesignator) & (GR_file['ParameterID_R'] == ParameterID)]
    min_va = row['GlobalRangeMin'].values[0]
    max_va = row['GlobalRangeMax'].values[0]
    
    return min_va, max_va


<a id=321 ></a>
##### Load Global Ranges File.
- The file was created for the system to check its science parameters against a list of global ranges provided by Subject Matter Experts.
- The file has been provided for you in canvas and in the shared google drive.

In [6]:
global_ranges_file = pd.read_csv('data_qc_global_range_values.csv')

# use the lines below to view the first few rows of the file
pd.set_option('display.max_colwidth', None)
(global_ranges_file.head())

Unnamed: 0,ReferenceDesignator,ParameterID_R,ParameterID_T,GlobalRangeMin,GlobalRangeMax,_DataLevel,_Units,_Array ID,_Platform ID,_Instrument
0,CE01ISSM-MFD35-01-VEL3DD000,vel3d_c_eastward_turbulent_velocity,vel3d_c_eastward_turbulent_velocity,-3.0,3.0,L1,m s-1,CE01ISSM,MFD35,VEL3DD000
1,CE01ISSM-MFD35-01-VEL3DD000,vel3d_c_northward_turbulent_velocity,vel3d_c_northward_turbulent_velocity,-3.0,3.0,L1,m s-1,CE01ISSM,MFD35,VEL3DD000
2,CE01ISSM-MFD35-01-VEL3DD000,vel3d_c_upward_turbulent_velocity,vel3d_c_upward_turbulent_velocity,-1.0,1.0,L1,m s-1,CE01ISSM,MFD35,VEL3DD000
3,CE01ISSM-MFD35-01-VEL3DD000,seawater_pressure_mbar,seawater_pressure_mbar,0.0,1100000.0,,0.001 dbar,CE01ISSM,MFD35,VEL3DD000
4,CE01ISSM-MFD35-01-VEL3DD000,turbulent_velocity_east,turbulent_velocity_east,-3000.0,3000.0,L0,mm s-1,CE01ISSM,MFD35,VEL3DD000


<a id=322></a>
##### Example:

In [7]:
# How to use get_global_ranges function:
get_global_ranges(global_ranges_file, 'GP03FLMB-RIM01-02-CTDMOG060', 'density')


(1000.0, 1100.0)

<a id=33 ></a>
### Set Valid Data Functions.

For a given dataset:
- return the indices of the data points that are valid. 
- calculate the number of invalid data points.

In [8]:
def reject_nan(data):
    """
    Reject Nans.
    """
    ind = ~np.isnan(data)
    n = len(data) - len(data[ind])
    
    return ind, n


def reject_fill_values(data, FillValue):
    """
    Reject Fill Values.
    """
    ind = (data != FillValue)
    n = len(data) - len(data[ind])
    
    return ind, n


def reject_extreme_values(data, ExtremValue):
    """
    Reject Extreme Data Values.
    """
    ind = (data > -ExtremValue) & (data < ExtremValue)
    n = len(data) - len(data[ind])
    
    return ind, n


def reject_global_ranges(data, gmin, gmax):
    """
    Reject Data Outside Global Ranges.
    """
    ind = (data >= gmin) & (data <= gmax)
    n = len(data) - len(data[ind])
    
    return ind, n


def reject_outliers(data, m=3):
    """
    Reject Outliers Using Statndard Deviation.
    """
    stdev = np.nanstd(data)
    if stdev > 0.0:
        ind = abs(data - np.nanmean(data)) < m * stdev
    else:
        ind = len(data) * [True]
    n = len(data) - len(data[ind])
    
    return ind, n

<a id=34 ></a>
### Set Basic Statistics Function.
For a given dataset return the mean, the minimum, the maximum and the standard deviation.


In [9]:
def variable_statistics(data):
    """
    Calculate statistics
    """
    data = data.astype('float64')  # force variables to be float64 (float32 is not JSON serializable)

    d_mean = round(np.nanmean(data), 4)
    d_min = round(np.nanmin(data), 4)
    d_max = round(np.nanmax(data), 4)
    d_sd = round(np.nanstd(data), 4)

    return d_mean, d_min, d_max, d_sd

<a id=35 ></a>
### Set Percent Valid Data Function.

In [10]:
def percent_valid_data(n_all, n_stats):
    """
    Calculate the percent of valid data after rejecting erronuous values
    """
    return (n_stats/n_all) * 100

<a id=4 ></a>
### Write a Data Validation Processing Method.

- Read datasets.
- Use the function return_science_vars to retrieve the science parameters for the validation process.
- Define variables and their attributes.
- Use the Valid Data Functions (reject_* ) to count the number of data point with invalid values.
- Use variable_statistics function to calculate statistics.
- Use percent_valid_data function to calculate percent of good data.

In [11]:
%cd '/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/module4_NetCDF_Files'

# open an empty data frame (tabular array) to save the output of the for loop below:
df = pd.DataFrame()

# define extrem value:
ExtremValue = 1e7https://www.france24.com/en/20200705-algeria-buries-repatriated-skulls-of-resistance-fighters-as-it-marks-independence-from-france

for index, row in reviewlist.iterrows():
    
    # extract the names from the filename:
    deployment = row['files'].split('_')[0][-1]
    refdes = row['files'].split('_')[1][0:27]
    stream = row['files'].split('-')[5].split('_2')[0]
   
    # load data using xarray function (xr.open_dataset())
    ds = xr.open_dataset(row['files'], mask_and_scale=False)

    # return variables:
    sci_vars = return_science_vars(stream)
    
    # for every returned variable:
    for sci_var in sci_vars:
  
        # define variable and its attributes
        data = ds[sci_var].values 
        units = ds[sci_var].units
        FillValue = ds[sci_var]._FillValue
        g_min, g_max = get_global_ranges(global_ranges_file, refdes, sci_var)        
        n_all = len(data)
        
        # report on invalid data:
        ind, n = reject_nan(data)
        n_nan, data = n, data[ind]        
        
        ind, n = reject_fill_values(data, FillValue)
        n_fv, data = n, data[ind]
             
        ind, n = reject_extreme_values(data, ExtremValue)
        n_ev, data = n, data[ind]

        ind, n = reject_global_ranges(data, g_min, g_max)
        n_gr, data = n, data[ind]
        
        ind, n = reject_outliers(data, m=3)
        n_ol, data = n, data[ind]    
        
        # claculate statistics
        [meann, minn, maxx, std] = variable_statistics(data)
        
        # calculate percent of good data
        percent_good = percent_valid_data(n_all, len(data))      
        
        df0 = pd.DataFrame({
                            'sv': [sci_var], 'var_units': [units], 'fv:fill_value': [FillValue],
                            'n_fv': [n_fv],'gr:global_range': [[g_min, g_max]],'n_gr': [n_gr],
                            'n_nan': [n_nan], '+/-1e7:n_ev': n_ev,
                            'Outliers: n_ol': [n_ol], 
                            'mean': [meann],'min': [minn],'max': [maxx],'STD': [std], 
                            'valid (%)': [percent_good]                            
                           }, index = [deployment]) 
        df = df.append(df0)

pd.set_option('display.max_colwidth', None)
df

/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/class_material/module4_NetCDF_Files


Unnamed: 0,sv,var_units,fv:fill_value,n_fv,gr:global_range,n_gr,n_nan,+/-1e7:n_ev,Outliers: n_ol,mean,min,max,STD,valid (%)
1,density,kg m-3,-9999999.0,0,"[1000.0, 1100.0]",1,0,0,60,1025.0222,1023.8556,1025.7647,0.384,99.806699
1,practical_salinity,1,-9999999.0,0,"[0.0, 42.0]",0,0,0,73,32.4249,31.8524,33.0072,0.0572,99.768673
1,ctdmo_seawater_pressure,dbar,-9999.0,0,"[0.0, 6000.0]",0,0,0,29,19.7777,15.5802,24.4729,1.46,99.908103
1,ctdmo_seawater_temperature,ºC,-9999999.0,0,"[-2.0, 40.0]",0,0,0,0,9.9679,4.8412,15.3295,2.0768,100.0
1,ctdmo_seawater_conductivity,S m-1,-9999999.0,0,"[0.0, 6.0]",0,0,0,1,3.5556,3.1523,4.0365,0.183,99.996831
3,density,kg m-3,-9999999.0,0,"[1000.0, 1100.0]",16,0,0,64,1025.4781,1024.276,1026.6493,0.3509,99.786809
3,practical_salinity,1,-9999999.0,0,"[0.0, 42.0]",0,0,0,18,32.5979,31.17,33.6202,0.1049,99.952032
3,ctdmo_seawater_pressure,dbar,-9999.0,0,"[0.0, 6000.0]",0,0,0,21,38.5397,33.7772,43.0209,1.3158,99.944037
3,ctdmo_seawater_temperature,ºC,-9999999.0,0,"[-2.0, 40.0]",0,0,0,64,8.4976,5.8196,14.0092,1.8206,99.829447
3,ctdmo_seawater_conductivity,S m-1,-9999999.0,0,"[0.0, 6.0]",0,0,0,25,3.4436,3.2475,3.9572,0.1547,99.933378


<a id="5"></a>
### Summary of Findings. 
The data from all deployments are of good quality:

- There are 5 science parameters in each deployment file.

- The parameter attributes are similar between the 5 deployments examined.

- There are no fill values or extreme values in the parameters data arrays.

- There are few instances of data points that are nans or have exceeded global ranges. 

- The majority of erroneous values are outliers: values outside the 3-standard-deviation envelope.

- The percent of valid data for all parameters is high between 97% to 100%.

- Data outliers need to be plotted and checked using a Human in the loop examination. This is going to be the focus of the next lab.

### END