## Data Validation Assignment.


<span style='color:blue' size=20 > **Instructions:** </span> 

- Rename this notebook to include your name.
- Use one of the NetCDF files to complete Module 3 Lab assignment. 
    - Each NetCDF file name starts with a deployment number.
    - In the example notebook I have used '**deployment0004**_GP03FLMB-R*.nc. 
    - Do not use Deployment 4 for you assignment.
    - Follow the outline below to reproduce the Sensor Location Review Process.

### Outline.

- [Extract the Deployment Number from the Data File.](#1)
- [Validate the Deployment Number Against the Data File Name.](#2)
- [Extract the Pressure Array from The Data File.](#3)
- [Check for Erroneous Data in the Pressure Array.](#4)
- [Calculate the Pressure Datasets Basic Statistics.](#5)
- [Compare Pressure to the Deployment Depth.](#6)
- [Calculate the Lon/Lat Difference in km.](#7)
- [Summarize Results.](#8)
- [Report on the Sensor Location Review.](#9)

Python Packages

In [1]:
# import python packages
import xarray as xr
import pandas as pd
import numpy as np
from geopy.distance import geodesic

  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)


Load Data File

In [4]:
# use the path to your data file to change your directory
%cd H:\GEOS689\Lab3
# list the files of the current directory to get the name of the file you want to use.
%ls

# load the data
file_content = xr.open_dataset('deployment0003_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20150608T213001-20160703T183001.nc')

# print content
file_content

H:\GEOS689\Lab3
 Volume in drive H is Homes
 Volume Serial Number is 8000-0445

 Directory of H:\GEOS689\Lab3

06/25/2020  03:44 PM    <DIR>          .
06/24/2020  12:16 PM    <DIR>          ..
06/25/2020  03:44 PM    <DIR>          .ipynb_checkpoints
06/25/2020  03:44 PM           206,233 00003_DataEvaluation_TimeCoverage.ipynb
06/25/2020  03:35 PM            38,581 00003_LocationReviewAssignment_LoriBryan.ipynb
06/25/2020  02:02 PM            28,646 00003_TimeCoverageAssignment_LoriBryan-Copy1.ipynb
06/19/2020  04:22 PM             3,457 00003_TimeCoverageAssignment_YourName.ipynb
06/19/2020  04:21 PM         2,105,359 deployment0001_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20130724T064501-20140617T234501.nc
06/25/2020  03:41 PM         2,487,937 deployment0003_GP03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20150608T213001-20160703T183001.nc
06/19/2020  04:21 PM         2,404,578 deployment0005_GP03FLMB-RIM01-02-CTDMOG060-

<xarray.Dataset>
Dimensions:                                  (obs: 37525)
Coordinates:
  * obs                                      (obs) int32 0 1 2 ... 37523 37524
Data variables:
    practical_salinity                       (obs) float64 ...
    ctd_time                                 (obs) datetime64[ns] ...
    density_qc_executed                      (obs) uint8 ...
    driver_timestamp                         (obs) datetime64[ns] ...
    id                                       (obs) |S36 ...
    conductivity                             (obs) float64 ...
    ctdmo_seawater_pressure_qc_executed      (obs) uint8 ...
    practical_salinity_qc_results            (obs) uint8 ...
    temperature                              (obs) float64 ...
    ctdmo_seawater_conductivity_qc_results   (obs) uint8 ...
    density                                  (obs) float64 ...
    ctdmo_seawater_conductivity_qc_executed  (obs) uint8 ...
    provenance                               (obs) |S36 ...


<a id="1"></a>
Extract the Deployment Number from the Data File.

In [5]:
# extract the deployment number
deployment_num = np.unique(file_content['deployment'])[0]
deployment_num

3

<a id="2"></a>
Validate the Deployment Number Against the Data File Name.

Deployment number is supposed to match the number in the filename.  I chose the deployment0006 file and it matches the output [6]

Load Deployment File.

In [6]:
# use the path to the deployment file to change your directory. (the .csv file)
# Load Data
deployment_file = pd.read_csv('GP03FLMB-RIM01-02-CTDMOG060_cruise_info.csv')

# Print content
deployment_file

Unnamed: 0,Deployment,Cruise,Start Date,Stop Date,Mooring Asset,Node Asset,Sensor Asset,Latitude,Longitude,Deployment Depth,Water Depth
0,1,MV-1309,2013-07-24,2014-06-18,CGMGP-03FLMB-00001,,CGINS-CTDMOG-10255,50.3317,-144.401,30.0,4145
1,2,MV-1404,2014-06-20,2015-06-07,CGMGP-03FLMB-00002,,CGINS-CTDMOG-11646,50.3313,-144.398,31.0,4145
2,3,TN-323,2015-06-08,2016-07-03,CGMGP-03FLMB-00003,,CGINS-CTDMOG-12638,50.3303,-144.398,47.0,4145
3,4,RB-16-05,2016-07-04,2017-07-17,CGMGP-03FLMB-00004,,CGINS-CTDMOG-11638,50.3293,-144.398,,4146
4,5,SR17-10,2017-07-14,2018-07-25,CGMGP-03FLMB-00005,,CGINS-CTDMOG-13422,50.3777,-144.515,,4169
5,6,SR1811,2018-07-24,2019-09-27,CGMGP-03FLMB-00006,,CGINS-CTDMOG-10225,50.3295,-144.398,,4145
6,7,SKQ201920S,2019-09-27,,CGMGP-03FLMB-00007,,CGINS-CTDMOG-10218,50.3755,-144.514,,4176


In [8]:
# Extract the deployment information using the deployment column above.
# Extract the deployment_num variable defined in the previous cell.
deployment_x = deployment_file[deployment_file['Deployment'] == deployment_num]

# Print row
deployment_x

Unnamed: 0,Deployment,Cruise,Start Date,Stop Date,Mooring Asset,Node Asset,Sensor Asset,Latitude,Longitude,Deployment Depth,Water Depth
2,3,TN-323,2015-06-08,2016-07-03,CGMGP-03FLMB-00003,,CGINS-CTDMOG-12638,50.3303,-144.398,47.0,4145


<a id="3"></a>
Extract the Pressure Array from The Data File.

In [9]:
# Determine what variable is used to check the pressure array?
# Do this by listing the variable names.
list_variables = file_content.variables.keys()

# Selct the variables using the keywork 'pressure'.
pressure_name = [x for x in tuple(list_variables) if 'pressure' in x]
print(pressure_name)

['ctdmo_seawater_pressure_qc_executed', 'ctdmo_seawater_pressure_qc_results', 'pressure_temp', 'pressure', 'ctdmo_seawater_pressure']


In [10]:
# Determine what pressure variable has the science unit dbar?
# Accomplish this by selecting the variable with the unit dbar.
for x in pressure_name:
    try: 
        x_unit = file_content[x].attrs['units']
        if x_unit == 'dbar':
            print('Pass:', x)
    except KeyError:
        print('Fail:', x)

Fail: ctdmo_seawater_pressure_qc_executed
Fail: ctdmo_seawater_pressure_qc_results
Pass: ctdmo_seawater_pressure


In [11]:
# the answer is ctdmo_seawater_pressure

# Get the pressure array by using the name of the variable that passed the unit test above.
pressure = file_content['ctdmo_seawater_pressure']

# Get and print the pressure attribute names.
pressure.attrs.keys()

# _FillValues is missing here.  
# I tried to delete the deployment file, re-download it and start over.  Same error.
# I tried using Deployment 0006 same issue.

odict_keys(['comment', 'long_name', 'coordinates', 'data_product_identifier', 'standard_name', 'units', 'ancillary_variables'])

In [12]:
# create a table to view the content.
# Put in a dataframe the pressure attributes to look at the content.
df = pd.DataFrame()
df0 = pd.DataFrame({
                    'Long Name':[pressure.long_name],
                    'Standard Name': [pressure.standard_name],
                    'Comment': [pressure.comment],
                    'Coordinates': [pressure.coordinates],                    
                    'Units': [pressure.units],
#                   'Fill_values': [pressure._FillValue],
                    'Ancillary Variables': [pressure.ancillary_variables],
                    'Data Product Identifier': [pressure.data_product_identifier]
    
                    }, index=['Pressure'])

df = df.append(df0)
pd.set_option('display.max_colwidth', -1)
df.T

Unnamed: 0,Pressure
Long Name,Seawater Pressure
Standard Name,sea_water_pressure
Comment,Seawater Pressure refers to the pressure exerted on a sensor in situ by the weight of the column of seawater above it. It is calculated by subtracting one standard atmosphere from the absolute pressure at the sensor to remove the weight of the atmosphere on top of the water column. The pressure at a sensor in situ provides a metric of the depth of that sensor.
Coordinates,time lat lon pressure
Units,dbar
Ancillary Variables,"pressure,pressure_temp"
Data Product Identifier,PRESWAT_L1


<a id="4"></a>
Check for Erroneous Data in the Pressure Array.

In [15]:
# Reject Nans.
# Use function: ~np.isnan()
p_nonan = pressure.values[~np.isnan(pressure.values)]

# Calculate the number of data point that are Nans.
len_nan = len(pressure) - len(p_nonan)

# Reject fill values. 
# Use operand: !=
# Use pressure._FillValue: returns the data fill value (-9999, see previous output).
##p_nonan_nofv = p_nonan[p_nonan != pressure._FillValue]

# Calculate the number of data point that are fill values.
##len_nan_fv = len(pressure) - len(p_nonan_nofv)

# Reject data outside global ranges.
# Use operands:( >= )  & (  <= )
# Use pressure global ranges: [0, 6000] dbar
##p_nonan_nofv_gr = p_nonan_nofv[(p_nonan_nofv >= 0) & (p_nonan_nofv <= 6000)]

# Calculate the number of data point that are outside [0,6000].

##len_nan_fv_gr = len(pressure) - len(p_nonan_nofv_gr)

# Reject extreme values.
# Use operands:( > )  & (  < )
# Use extreme values: [-1e7, 1e7]
##p_nonan_nofv_gr_ev = p_nonan_nofv_gr[(p_nonan_nofv_gr > -1e7) & (p_nonan_nofv_gr < 1e7)]

# Calculate the number of data point that are outside [-1e7, 1e7].
##len_nan_fv_gr_ev = len(pressure) - len(p_nonan_nofv_gr_ev)

# Reject outliers beyond 3 standard deviations of the mean.
# Use standard deviation function: np.nanstd
##stdev = np.nanstd(p_nonan_nofv_gr_ev)

# Use function to calculate the mean: np.nanmean()
##mean_pressure = np.nanmean(p_nonan_nofv_gr_ev)

# Use formula: abs(data - np.nanmean(data)) < 3 * stdev 
##p_nonan_nofv_gr_ev_std = p_nonan_nofv_gr_ev[abs(p_nonan_nofv_gr_ev - mean_pressure) < 3 * stdev]

# Calculate the number of data point that are outside 3 standard deviations of the mean
##len_nan_fv_gr_ev_std = len(pressure) - len(p_nonan_nofv_gr_ev_std)

# Add a note to report on when the pressure array is not valid 
# Not valid:  all Nans 
#          or all fill values 
#          or all values outside of global ranges 
#          or all values are extreme values.

##notes = ['']
##if len(pressure) > 0 and len(p_nonan) == 0: # NaNs
##    notes.append('Pressure variable all NaNs')
##elif len(pressure) > 0 and len(p_nonan) > 0 and len(p_nonan_nofv) == 0: # fill values
##    notes.append('Pressure variable all fill values')
##elif len(pressure) > 0 and len(p_nonan) > 0 and len(p_nonan_nofv) > 0 and len(p_nonan_nofv_gr) == 0: # outside of global ranges
##    notes.append('Pressure variable outside of global ranges')
##elif len(pressure) > 0 and len(p_nonan) > 0 and len(p_nonan_nofv) > 0 and len(p_nonan_nofv_gr) == 0  and len(p_nonan_nofv_gr_ev) == 0:
##    notes.append('Pressure variable are beyond (+/-)1e7 ')
   
##print(notes)

<a id="5"></a>
Calculate the Pressure Datasets Basic Statistics.

<a id="6"></a>
Compare Pressure to the Deployment Depth.

<a id="7"></a>
Calculate the Lon/Lat Difference in km.

<a id="8"></a>
Summarize Results.

<a id="9"></a>
Report on the Sensor Location Review.