# Explore the Data File Metadata.

The data file to be used in Module 1 Lab assignment is a NetCDF file. The following is a brief description of the file type, its data structure, and why it is suitable for the Metadata exercise.

#### File Type: 

- The file type used for this exercise is the Network Common Data Form file (NetCDF) with file extension "*.nc".
- NetCDF is based on a binary file storage mechanism.
- It is a self-documenting data storage and data access method that is used in the geoscience filed.



 #### Data Structure:
 
- NetCDF is commonly used for storing and interchanging multidimensional scientific data variables (Schema 1).
- This file type includes data (observation values) and descriptive information about the data (Metadata). 
- Data are displayed through a dimension such as time, date, latitude, longitude, or depth. 


##### Why NetCDF File?

- The NetCDF files are easy to manipulate.
- Metadata is entered in a structured way, so it is easy to retrieve.
- Metadata entries can use a free text format and a good number of words.
- Other file types do not all share these characteristics. Such files are: excel (.xlsx), text (.txt), and comma separated (.csv) file types.

### Schema 1.
#### NetCDF Multidimensional Data Structure Example.
<img src='NetCDF_dim.png' />

### <span style='color:Green'> Outline   </span>

- In this exercise we are going to execute the following steps to learn how to extract metadata information from the NetCDF files:

    * [Step 0. Import Python Packages.](#0)
    
    * [Step 1. Access, Load, and Print the Data File Content.](#1)
    
        * [Define data file.](#2)
        
        * [Load data file.](#3)
        
        * [Print the length, the type, and the content of the data file.](#4)
        
        * [What is the data File Content.](#5)
        
    * [Step 2. Extract and Print File Elements Separately.](#6)
    
        * [Python syntax.](#7)
        
        * [Dimensions.](#8)
        
            * [Explore the parameters contained in the Dimensions content dictionary.](#9)
            
        * [Coordinates.](#10)
         
            * [Explore the parameters contained in the Coordinates content array.](#11)
             
        * [Data variables.](#12)
                    
            * [List the parameters in Data variables array.](#13)
             
            * [List the information associated with a parameter.](#14)
            
            * [Extract parameter attributes.](#41)
             
        * [Attributes.](#15)
         
            * [Explore the parameters contained in the Attributes content dictionary.](#16)
             
            * [Extract the information in the Attributes content key.](#17)
             
    * [Summary.](#18)
        

- [x] <span style='color:Orange' size=20 > **Attention:** </span> This lab learning material is key and can be used s a reference to the rest of the course labs. 

<a id="0"></a>
### Step 0. Import Python Packages.
Access Python library and import xarray and pandas packages needed to run the code lines in this notebook.

**Note:**   When you add a package make sure you add it before you use it in a code line.

In [1]:
# xr, pd are the package IDs used to call python's function to run code lines.
import xarray as xr
import pandas as pd

<a id="1"></a>

### Step 1. Access, Load , and Print the Data File Content.        

<a id="2"></a>

#### Define data file.

set a variable **filename** to your data file name

    - If the data file is visible in your dashboard you only need to enter the file name to the variable filename
        - filename = '(your data file name).nc'
                              
    - Check the directory you are in using this command:
        - %pwd
                  
    - If the data file is not in a different directory, change directory to where your data file is stored
        - %cd (path to your file)
        - filename = '(your data file name).nc'

In [71]:
# If the data file is in your notebook directory you only need to enter the file name to the variable filename

filename = 'FGBNMS_FGBNMS-15-09_Stetson_Bank_Long_Term_Monitoring_1_bf82_615c_5b81.nc'

# If the data file is not in your notebook directory you need to change directory to where your data file is stored

# Check the directory you are in
%pwd


'/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/NetCDF-Files'

In [2]:
# change directory to where your data file is stored
%cd '/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/NetCDF-Files/'
filename = 'cp_339-20200302T1109.nc3.nc'

/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/NetCDF-Files


<a id="3"></a>
#### Load Data File. 
Use xarray (**xr**) to open and read the file_content.

In [3]:
# load data
file_content = xr.open_dataset(filename,decode_times=False) #decode_cf=False 

<a id="4"></a>
#### Print the Lenght, the Type, and the Content of the Data File.
Python functions used are: **print, len, type**

In [4]:
# Print the lenght of the file_content using the print and len functions

print('- The number of variables in the file:    ', len(file_content), '\n\n')

# Print the type of the file_content using the print and type functions

print('- The file variables are loaded in:    ', type(file_content), '\n\n')

# Print the content of the file

print('- File Content:    ', '\n\n', file_content, '\n\n')


- The number of variables in the file:     43 


- The file variables are loaded in:     <class 'xarray.core.dataset.Dataset'> 


- File Content:     

 <xarray.Dataset>
Dimensions:               (obs: 380, profile: 6906, trajectory: 1)
Coordinates:
  * trajectory            (trajectory) object 'cp_339-20200302T1109'
    time                  (trajectory, profile) float64 ...
    latitude              (trajectory, profile) float64 ...
    longitude             (trajectory, profile) float64 ...
    time_uv               (trajectory, profile) float64 ...
    lat_uv                (trajectory, profile) float64 ...
    lon_uv                (trajectory, profile) float64 ...
    depth                 (trajectory, profile, obs) float32 ...
Dimensions without coordinates: obs, profile
Data variables:
    wmo_id                (trajectory) object ...
    profile_id            (trajectory, profile) float64 ...
    u                     (trajectory, profile) float64 ...
    v                    




888 562 8662

####   Data File Organization:
    - **Elements:** variables are grouped under elements [Dimensions, Coordinates, Data variables, and Attributes].
        - **Variables:**  paramters are grouped under variables [depth, temperature, speed, ...]. 
            - **Parameters:** data information are listed under parameters [values, units, precision, ...].

- [x] <span style='color:Blue'> _**Metadata**:_  </span> the metadata exists in any of the file elements described above. The rest of the notebook will show how to access the metadata information.

<a id="6"></a>
### Step 2. Extract and Print File Elements Seperately.

<a id="7"></a>
#### Python Syntax.
The syntax to use is: 


|**syntax:** file_content. _attributes_ |file_content |. |_attributes_|
| - | - | - | - |
|**Description**| the variable previously defined |dot |attributes of file_content: replace attributes by coords, variables, or dim |



<a id="8"></a>
###  <span style='color:Purple'> Dimensions  </span>

In [5]:
# Print the Dimensions lenght using the print and len functions

print('- The number of parameters in Dimenssions:    ', len(file_content.dims), '\n\n')

# Print the Dimensions type using the print and type functions

print('- Dimensions is loaded in:    ', type(file_content.dims), '\n\n')

# Print the Dimensions content

print('- Dimensions content:    ', file_content.dims, '\n\n')


- The number of parameters in Dimenssions:     3 


- Dimensions is loaded in:     <class 'xarray.core.utils.Frozen'> 


- Dimensions content:     Frozen(SortedKeysDict({'trajectory': 1, 'profile': 6906, 'obs': 380})) 




<a id="9"></a>

##### Explore the parameters contained in the Dimensions  content dictionary.
- Set a variable **dims_var** to one of the parameters stored in the Dimensions content.
- Get the parameter name from the output of the privious cell.
- Use the following syntax: **file_content.dims[dims_var]**

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| dims| attributes of file_content |
| | |
| [dims_var] | variable name between brackets |

- If the Dimensions content is not empty print its content.

In [8]:
# set a variable dims_var
dims_var = 'trajectory'

# Print the content of the parameter dims_var stored in the Dimensions dictionary

if file_content.dims:
    print(dims_var, ':    ', file_content.dims[dims_var], '\n\n')

trajectory :     1 




- [x] <span style='color:Blue'> Metadata:  </span> we espect to have 20 data points for each parameter measured. 

<a id="10"></a>
### <span style='color:Purple'> Coordinates  </span>

In [9]:
# Print Coordinates lenght using the print and len functions

print('- The number of parameters in Coordinates:    ', len(file_content.coords), '\n\n')

# Print Coordinates type using the print and type functions

print('- Coordinates is loaded in:    ', type(file_content.coords), '\n\n')

# Print Coordinates content

print('- Coordinates content:    ', '\n\n', file_content.coords, '\n\n')

- The number of parameters in Coordinates:     8 


- Coordinates is loaded in:     <class 'xarray.core.coordinates.DatasetCoordinates'> 


- Coordinates content:     

 Coordinates:
  * trajectory  (trajectory) object 'cp_339-20200302T1109'
    time        (trajectory, profile) float64 ...
    latitude    (trajectory, profile) float64 ...
    longitude   (trajectory, profile) float64 ...
    time_uv     (trajectory, profile) float64 ...
    lat_uv      (trajectory, profile) float64 ...
    lon_uv      (trajectory, profile) float64 ...
    depth       (trajectory, profile, obs) float32 ... 




<a id="11"></a>

##### Explore the parameters contained in the Coordinates content array.
- Set a variable **coord_var** to one of the parameters stored in the Coordinates content.
- Get the parameter name from the output of the privious cell.
- Use the variable attribures to extract the parameter content: **values, attrs, dims, coord**
- The syntax to use is:

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| coords| attributes of file_content |
| | |
| [coord_var] | variable name between brackets  |
| | |
|   .  | dot|
| | |
| values, attrs, dims, coords| attributes of coord_var |

- If the Coordinates array is not empty print its content.

In [10]:
# set a variable coord_var
coord_var = 'time'
# Print the parameter content stored in the Coordinates array
if file_content.coords:    
    print(str(coord_var),'values:    ', '\n\n', file_content.coords[coord_var].values, '\n\n')
    print(str(coord_var),'attributes:    ', '\n\n', file_content.coords[coord_var].attrs, '\n\n')
    print(str(coord_var),'dimensions:    ', '\n\n', file_content.coords[coord_var].dims, '\n\n')
    print(str(coord_var),'coordinates:    ', '\n\n', file_content.coords[coord_var].coords, '\n\n')

time values:     

 [[1.58314784e+09 1.58315105e+09 1.58315178e+09 ... 1.58960168e+09
  1.58960298e+09 1.58960430e+09]] 


time attributes:     

 {'_CoordinateAxisType': 'Time', 'actual_range': array([1.58314784e+09, 1.58960430e+09]), 'axis': 'T', 'calendar': 'gregorian', 'comment': 'Timestamp corresponding to the mid-point of the profile.', 'ioos_category': 'Time', 'long_name': 'Profile Time', 'observation_type': 'calculated', 'platform': 'platform', 'standard_name': 'time', 'time_origin': '01-JAN-1970 00:00:00', 'units': 'seconds since 1970-01-01T00:00:00Z', 'valid_max': nan, 'valid_min': 0.0} 


time dimensions:     

 ('trajectory', 'profile') 


time coordinates:     

 Coordinates:
  * trajectory  (trajectory) object 'cp_339-20200302T1109'
    time        (trajectory, profile) float64 1.583e+09 1.583e+09 ... 1.59e+09
    latitude    (trajectory, profile) float64 ...
    longitude   (trajectory, profile) float64 ...
    time_uv     (trajectory, profile) float64 ...
    lat_uv    

- [x] <span style='color:Blue'> Metadata:  </span> the coordinates are set as the observations values.

<a id="12"></a>
### <span style='color:Purple'> Data Variables  </span>

In [20]:
# Print Data variables lenght using the print and len functions

print('- The number of parameters in Data variables:    ', len(file_content.variables), '\n\n')

# Print Data variables type using the print and type functions

print('- Data variables is loaded in:    ', type(file_content.variables), '\n\n')

# Print Data variables content

print('- Data variables content:    ', '\n\n', file_content.variables, '\n\n')

- The number of parameters in Data variables:     51 


- Data variables is loaded in:     <class 'xarray.core.utils.Frozen'> 


- Data variables content:     

 Frozen({'trajectory': <xarray.IndexVariable 'trajectory' (trajectory: 1)>
array(['cp_339-20200302T1109'], dtype=object)
Attributes:
    _ChunkSizes:    20
    cf_role:        trajectory_id
    comment:        A trajectory is one deployment of a glider.
    ioos_category:  Identifier
    long_name:      Trajectory Name, 'wmo_id': <xarray.Variable (trajectory: 1)>
array(['4801957'], dtype=object)
Attributes:
    ioos_category:  Identifier
    long_name:      WMO ID, 'profile_id': <xarray.Variable (trajectory: 1, profile: 6906)>
array([[1.000e+00, 2.000e+00, 3.000e+00, ..., 6.904e+03, 6.905e+03, 6.906e+03]])
Attributes:
    actual_range:         [   1 6906]
    ancillary_variables:  profile_time
    cf_role:              profile_id
    comment:              Sequential profile number within the trajectory. Th...
    ioos_category:

<a id="13"></a>
##### List the parameters in Data variables array.
- To get the the dictionary keys add .keys() to file_content.variables
- If the Data variables array is not empty print its content.

In [19]:
if file_content.variables:
    print('- List of parameter names:    ', \
      '\n\n', list(file_content.variables.keys()), '\n\n')

- List of parameter names:     

 ['trajectory', 'wmo_id', 'profile_id', 'time', 'latitude', 'longitude', 'time_uv', 'lat_uv', 'lon_uv', 'u', 'v', 'precise_time', 'depth', 'pressure', 'temperature', 'conductivity', 'salinity', 'density', 'precise_lat', 'precise_lon', 'platform_meta', 'instrument_ctd', 'precise_lon_qc', 'conductivity_qc', 'temperature_qc', 'precise_time_qc', 'lat_uv_qc', 'density_qc', 'longitude_qc', 'lon_uv_qc', 'time_uv_qc', 'latitude_qc', 'u_qc', 'v_qc', 'depth_qc', 'time_qc', 'pressure_qc', 'precise_lat_qc', 'salinity_qc', 'radiation_wavelength', 'backscatter', 'instrument_flbbcd', 'dissolved_oxygen', 'instrument_oxygen', 'pitch', 'roll', 'PAR', 'instrument_par', 'CDOM', 'chlorophyll', 'oxygen_saturation'] 




<a id="14"></a>
##### List the information associated with the parameter.

- Set a variable **data_var** to one of the parameters stored in the Data variables array.
- Get the parameter name from the output of the previous cell.
- Use the same syntax used for the Coordinates parameters.
- If the Data variables array is not empty print its content.

In [25]:
# Set variable data_var
data_var = 'roll' 
if file_content.coords:    
    print(str(data_var),'values:    ', '\n\n', file_content.variables[data_var].values, '\n\n')
    print(str(data_var),'attributes:    ', '\n\n', file_content.variables[data_var].attrs, '\n\n')
    print(str(data_var),'dimensions:    ', '\n\n', file_content.variables[data_var].dims, '\n\n')
    print(str(data_var),'coordinates:    ', '\n\n', file_content.variables[data_var].coords, '\n\n')

roll values:     

 [[[-6.7999777  -6.7999777  -6.7999777  ...         nan         nan
           nan]
  [-6.09999517 -6.09999517 -6.09999517 ...         nan         nan
           nan]
  [-6.09999517 -6.09999517 -6.09999517 ...         nan         nan
           nan]
  ...
  [-5.79999446 -5.79999446 -5.79999446 ...         nan         nan
           nan]
  [-5.40000117 -5.40000117 -5.40000117 ...         nan         nan
           nan]
  [-5.30000284 -5.30000284 -5.30000284 ...         nan         nan
           nan]]] 


roll attributes:     

 {'_ChunkSizes': 107, 'actual_range': array([-27.00000584,   4.59999739]), 'bytes': 4, 'comments': 'm_roll converted to degrees and forward filled', 'ioos_category': 'Other', 'long_name': 'Glider Vehicle Roll Angle', 'observation_type': 'measured', 'platform': 'platform', 'source_sensor': 'm_roll', 'standard_name': 'platform_roll_angle', 'units': 'degrees', 'valid_max': 90.0, 'valid_min': -90.0} 


roll dimensions:     

 ('trajectory', 'profil

AttributeError: 'Variable' object has no attribute 'coords'

**Attention:**

- The error above point to the missing coordinates for the parameter 'temperature'.
- The coordinates in the file are added as an attribute to the parmeter measured (see temperature attributes output of the previous cell).

<a id="41"></a>
##### Extract the parameter attributes.
- **Note:** The information is stored in a dictionary.
- Get the attribute name from the temperature attributes listed in the output of the previous cell.
- Set the variable **data_var_attrs** to list one of the attribute names.
- Use the syntax below:

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| variables| attributes of file_content |
| | |
| [data_var] | variable name between brackets |
| | |
|   .  | dot|
| | |
| attrs| attributes of data_var |
| | |
| [data_var_attrs] | variable name between brackets |

- If the temperature attributes array is not empty print its content

In [26]:
data_var_attrs = 'units'
if file_content.variables[data_var].attrs:    
    print(data_var, data_var_attrs,':    ', '\n\n', \
          file_content.variables[data_var].attrs[data_var_attrs],\
         type(file_content.variables[data_var].attrs[data_var_attrs]), '\n\n')

    

roll units :     

 degrees <class 'str'> 




- [x] <span style='color:Blue'> Metadata:  </span> The parameter coordinates is a string of the name of parameters to use as coordinates. Note that more metadata is available in the other parameter attributes such as units and precision. 

<a id="15"></a>
### <span style='color:Purple'> Attributes  </span>

In [89]:
# Print the lenght of the element Data variables using the print and len functions

print('- The number of parameters in Attributes:    ',len(file_content.attrs), '\n\n')

# Print the type of the element Data variables using the print and type functions

print('- Attributes is loaded in:    ', type(file_content.attrs), '\n\n')

# list the data variables using the pandas DataFrame function

print('- Attributes content:    ', '\n\n', file_content.attrs, '\n\n')

- The number of parameters in Attributes:     70 


- Attributes is loaded in:     <class 'dict'> 


- Attributes content:     

 {'node': 'RIM01', 'comment': '', 'publisher_email': '', 'sourceUrl': 'http://oceanobservatories.org/', 'collection_method': 'telemetered', 'stream': 'ctdmo_ghqr_sio_mule_instrument', 'featureType': 'point', 'creator_email': '', 'publisher_name': 'Ocean Observatories Initiative', 'date_modified': '2019-10-17T03:16:56.980560', 'keywords': '', 'cdm_data_type': 'Point', 'references': 'More information can be found at http://oceanobservatories.org/', 'Metadata_Conventions': 'Unidata Dataset Discovery v1.0', 'date_created': '2019-10-17T03:16:56.980553', 'id': 'GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument', 'requestUUID': 'be8cec74-60c4-4df9-b09b-07063ebf642d', 'contributor_role': '', 'summary': 'Dataset Generated by Stream Engine from Ocean Observatories Initiative', 'keywords_vocabulary': '', 'institution': 'Ocean Observatories Initiativ

<a id="16"></a>
##### Explore the parameters contained in the Attributes content dictionary.
- **Note:** The parameter names are in the Attributes content dictionary keys.
- To get the the dictionary keys add .keys() to file_content.attrs
- If the Attributes content is not empty print its content.

In [91]:
# Print the list of parameters found in the Attributes
if file_content.attrs:
    print('Attributes Dictionary Keys:    ', '\n\n', file_content.attrs.keys(), '\n\n')

Attributes Dictionary Keys:     

 dict_keys(['node', 'comment', 'publisher_email', 'sourceUrl', 'collection_method', 'stream', 'featureType', 'creator_email', 'publisher_name', 'date_modified', 'keywords', 'cdm_data_type', 'references', 'Metadata_Conventions', 'date_created', 'id', 'requestUUID', 'contributor_role', 'summary', 'keywords_vocabulary', 'institution', 'naming_authority', 'feature_Type', 'infoUrl', 'license', 'contributor_name', 'uuid', 'creator_name', 'title', 'sensor', 'standard_name_vocabulary', 'acknowledgement', 'Conventions', 'project', 'source', 'publisher_url', 'creator_url', 'nodc_template_version', 'subsite', 'processing_level', 'history', 'Manufacturer', 'ModelNumber', 'SerialNumber', 'Description', 'FirmwareVersion', 'SoftwareVersion', 'AssetUniqueID', 'Notes', 'Owner', 'RemoteResources', 'ShelfLifeExpirationDate', 'Mobile', 'AssetManagementRecordLastModified', 'time_coverage_start', 'time_coverage_end', 'time_coverage_resolution', 'geospatial_lat_min', 'geospa

<a id="18"></a>
##### Extract the information in the Attributes content  key.
- set a variable **attrs_var** to one of the parameters stored in the Attributes dictionary key.
- If the Attributes content is not empty print its content.

In [100]:
# set a variable attrs_var
attrs_var = 'Description' 
if file_content.attrs:
    print(attrs_var,':    ', file_content.attrs[attrs_var])

Description :     CTD Mooring (Inductive): CTDMO Series G


- [x] <span style='color:Blue'> Metadata:  </span> The sensor used to collect data is a "CTD Mooring (Inductive): CTDMO Series G". More information on the data may be extracted by setting the attrs_var to other attribues aparameters.

<a id="19"></a>
<span style='color:Green'>  Metadata Summary:  </span>

- In the file examined, the metadata exist in any of the file elements described above: **_Dimensions, Attributes, Data variables, Coordinates_**.

- The code lines used in this notebook is what you need to extract the metadata information.

The metadata information can be grouped in the following categories:
- **Category 1:** The data file information.
    - 'date_created', 'date_modified', 'ShelfLifeExpirationDate',
    - 'AssetManagementRecordLastModified', 'FirmwareVersion', 'SoftwareVersion'
    - 'keywords', 'keywords_vocabulary',
    - 'acknowledgement', 'license', 'history', 'Notes', 'comment',
    - 'title', 'source', 'summary', 'id', 'requestUUID', 'uuid',
    
    
- **Category 2:** The project information.
    - 'Owner', 'institution', 'references', 'project', 
    - 'contributor_name', 'contributor_role',
    - 'creator_name','creator_email', 'creator_url',
    - 'publisher_name', 'publisher_email', 'publisher_url', 
    - 'infoUrl','sourceUrl', 'naming_authority',
   
   
- **Category 3:** The data collection information.
    - 'subsite', 'node', 'sensor','stream', 'Mobile', 'collection_method'
    - 'Manufacturer', ModelNumber, 'SerialNumber', 'Description', 'AssetUniqueID'
    - 'processing_level', 'cdm_data_type', 'featureType' 
    - 'time_coverage_start', 'time_coverage_end', 'time_coverage_resolution',
    - 'lat', 'lon',
    - 'geospatial_lat_min', 'geospatial_lat_max', 'geospatial_lat_units', 'geospatial_lat_resolution', 
    - 'geospatial_lon_min', 'geospatial_lon_max', 'geospatial_lon_units', 'geospatial_lon_resolution', 
    - 'geospatial_vertical_units', 'geospatial_vertical_resolution', 'geospatial_vertical_positive'

    
- **Category 4:** The convientions used to creat the metadata or data.
    - 'Metadata_Conventions', 'nodc_template_version', 'Conventions', 'standard_name_vocabulary'
  
  
- **Category 5:** The parameter measured information.
    - 'comment'
    - 'long_name' 
    - 'standard_name'
    - 'precision'
    - 'data_product_identifier'
    - 'units'
    - 'values'
    - 'dimension'
    - 'coordinates'
    

<span style='color:Green'> What is Next:  </span>
- The information in the file is what you need to start evaluating the quality of your data. 
- Data quality evaluation will be more in details in the upcoming course labs.

## END