# Explore the Data File Metadata.

The data file to be used in Module 1 Lab assignment is a NetCDF file. The following is a brief description of the file type, its data structure, and why it is suitable for the Metadata exercise.

#### File Type: 

- The file type used for this exercise is the Network Common Data Form file (NetCDF) with file extension "*.nc".
- NetCDF is based on a binary file storage mechanism.
- It is a self-documenting data storage and data access method that is used in the geoscience filed.



 #### Data Structure:
 
- NetCDF is commonly used for storing and interchanging multidimensional scientific data variables (Schema 1).
- This file type includes data (observation values) and descriptive information about the data (Metadata). 
- Data are displayed through a dimension such as time, date, latitude, longitude, or depth. 


##### Why NetCDF File?

- The NetCDF files are easy to manipulate.
- Metadata is entered in a structured way, so it is easy to retrieve.
- Metadata entries can use a free text format and a good number of words.
- Other file types do not all share these characteristics. Such files are: excel (.xlsx), text (.txt), and comma separated (.csv) file types.

### Schema 1.
#### NetCDF Multidimensional Data Structure Example.

<img src= 'NetCDF_dim.png' />

### <span style='color:Green'> Outline   </span>

- In this exercise we are going to execute the following steps to learn how to work with the NetCDF files and extract metadata information:

    * [Step 0. Import Python Packages.](#0)
    
    * [Step 1. Access, Load, and Print the Data File Content.](#1)
    
        * [Define Data File.](#2)
        
        * [Load Data File.](#3)
        
        * [Print Data File.](#4)
        
        * [Data File Content.](#5)
        
    * [Step 2. Extract File Elements Separately.](#6)
    
        * [Python Syntax.](#7)
        
        * [Dimensions](#8)
        
            * [Explore Dimensions Variables.](#9)
            
        * [Coordinates.](#10)
         
            * [Explore Coordinates Variables.](#11)
             
        * [Data Variables.](#12)
         
            * [Explore Data Variables.](#13)
            
            * [Explore Data Variables Attributes.](#31)
             
            * [Extract the Attributes Content.](#14)
             
        * [Attributes.](#15)
         
            * [Explore Attributes Variables.](#16)
             
            * [Extract Attributes Information.](#17)
             
    * [Metadata Summary.](#18)
        

<span style='color:Orange' size=20 > **Attention:** </span> 
- This lab learning material is a reference for the rest of the course labs. 
- To run the notebook, you need to follow the septs in order.
- Do not forget to run you cell before you move on to the next one. 
- This is especially true for the code cells: the output of a cell may be an input in the next cell.

<a id="0"></a>
## <span font=Cooper size=20 > Step 0. Import Python Packages. </span>
To run the code lines in this notebook you need xarray and pandas python packages.
Use the import function to access the packages in the Python library.

<span style='color:Purple' size=20 > **Note:** </span> 
- If you need to add more packages you can do so in a code cell. 

- The package needs to be imported before it is called in the command line.


In [3]:
# xr, pd are the package IDs used to call python's function to run code lines.
import xarray as xr
import pandas as pd

<a id="1"></a>
## <span font=Cooper size=20 >  Step 1. Access, Load , and Print the Data File Content.  </span>


<a id="2"></a>
### Define Data File.

- set **filename** as a variable to define the data file:

    - If the data file is visible in your dashboard you only need to set the variable filename to the name of the data file:
        - filename = 'your data file name'   (**Note**: Your data file name should end with .nc)
                              
    - Check the directory you are in using this command:
        - %pwd
                  
    - If the data file is in a different directory, change directory to where your data file is stored using the following command:
        - %cd (path to your file)
        - **filename** = 'your data file name'    (**Note**: Your data file name should end with .nc)

In [5]:
# The data file is in your home dashboard.

filename = 'FGBNMS_FGBNMS-15-09_Stetson_Bank_Long_Term_Monitoring_1_bf82_615c_5b81.nc'

In [7]:
# The data file is not in your home dashboard. 
# Change directory to where your data file is stored.
# This is how to check the directory you are in:
%pwd

'/Users/leilabelabassi/Desktop/TAMU/online-class/612-DataQuality4theGeosciences/NetCDF-Files'

In [1]:
# This is how you change your directory to where your data file is stored:
%cd '/Users/leilabelabassi/Downloads/'

filename = 'deployment0002_GI03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered_20150821T171501-20160717T234501.nc'

/Users/leilabelabassi/Downloads


<a id="3"></a>
### Load Data File. 
- Use xarray (**xr**) to open and read the data file content.
- Set **file_content** as a variable to record the data file content.

In [4]:
# Load data
file_content = xr.open_dataset(filename,decode_times=False) #decode_cf=False 

<a id="4"></a>
### Print Data File.
- Use Python functions **print, len, type** to get more information about the content of the data file.

In [5]:
# Print the length of the file_content using the print and len functions

print('- The number of variables in the file:    ', len(file_content), '\n\n')

# Print the type of the file_content using the print and type functions

print('- The file variables are loaded in:    ', type(file_content), '\n\n')

# Print the content of the file

print('- File Content:    ', '\n\n', file_content, '\n\n')


- The number of variables in the file:     31 


- The file variables are loaded in:     <class 'xarray.core.dataset.Dataset'> 


- File Content:     

 <xarray.Dataset>
Dimensions:                                  (obs: 31803)
Coordinates:
  * obs                                      (obs) int32 0 1 2 ... 31801 31802
Data variables:
    practical_salinity                       (obs) float64 ...
    ctd_time                                 (obs) float64 ...
    density_qc_executed                      (obs) uint8 ...
    driver_timestamp                         (obs) float64 ...
    id                                       (obs) |S36 ...
    conductivity                             (obs) float64 ...
    ctdmo_seawater_pressure_qc_executed      (obs) uint8 ...
    practical_salinity_qc_results            (obs) uint8 ...
    temperature                              (obs) float64 ...
    ctdmo_seawater_conductivity_qc_results   (obs) uint8 ...
    density                                  

<a id="5"></a>
### Data File Content.

**Data File Elements:**
The NetCDF files are organized as follow:
- <span style='color:Purple'> Dimensions: </span>
   - Variables:  e.g., obs 
        - Parameters. e.g., values
- <span style='color:Purple'> Coordinates: </span>
   - Variables:  e.g., obs
        - Parameters.  e.g., values
- <span style='color:Purple'> Data variables: </span>
   - Variables:  e.g., time
        - Parameters. e.g., values, unit, long_name
- <span style='color:Purple'> Attributes: </span>
   - Variables: e.g., node
        - Parameters. e.g., information
        
 

______________________________________________
###  _Metadata:_  
- [x] <span style='color:blue'>   Metadata exists in any of the elements described above. The rest of the notebook will show how to access the metadata information. </span>
______________________________________________

<a id="6"></a>
## Step 2. Extract File Elements Seperately.


<a id="7"></a>
### Python Syntax. 
file_content. _attributes_ 


|  || ||
| -  | -  | -  | -  |
|<span style='color:blue'> **syntax:** </span>| file_content   | .  |  _attributes_ |
|**Description**|defined variable|dot |replace by file_content's attributes: <span style='color:Purple'> coords, variables, dim |



<a id="8"></a>
###  <span style='color:Purple'> Dimensions.  </span>

In [6]:
print('- The number of variables in Dimensions:    ', len(file_content.dims), '\n\n')

print('- Dimensions is loaded in:    ', type(file_content.dims), '\n\n')

print('- Dimensions content:    ', file_content.dims, '\n\n')


- The number of variables in Dimensions:     1 


- Dimensions is loaded in:     <class 'xarray.core.utils.Frozen'> 


- Dimensions content:     Frozen(SortedKeysDict({'obs': 31803})) 




<a id="9"></a>
##### Explore Dimensions variables.
- Set **dims_var** to one of the variables stored in Dimensions.
- Get the variable name from the output of the previous cell.
- Use the following syntax: **file_content.dims[dims_var]**

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| dims| attributes of file_content |
| | |
| [dims_var] | variable name between brackets |

- If the Dimensions content is not empty print its content.

In [7]:
# define dims_var
dims_var = 'obs'

# Print the content of dims_var

if file_content.dims:
    print(dims_var, ':    ', file_content.dims[dims_var], '\n\n')

obs :     31803 




________________________________________________
###  _Metadata:_ 
- [x] <span style='color:Blue'> We expect to have 20 data points for each parameter measured. </span> 
________________________________________________

<a id="10"></a>
### <span style='color:Purple'> Coordinates. </span>

In [8]:
print('- The number of parameters in Coordinates:    ', len(file_content.coords), '\n\n')

print('- Coordinates is loaded in:    ', type(file_content.coords), '\n\n')

print('- Coordinates content:    ', '\n\n', file_content.coords, '\n\n')

- The number of parameters in Coordinates:     1 


- Coordinates is loaded in:     <class 'xarray.core.coordinates.DatasetCoordinates'> 


- Coordinates content:     

 Coordinates:
  * obs      (obs) int32 0 1 2 3 4 5 6 ... 31797 31798 31799 31800 31801 31802 




<a id="11"></a>
##### Explore Coordinates variables.
- Set **coord_var** to one of the variables stored Coordinates.
- Get the variable name from the output of the previous cell.
- Use the variable attributes to extract the parameters' content: **values, attrs, dims, coord**
- The syntax to use is:

|syntax|description|
|-| -|
| file_content | defined variable  |
| | |
|   .  | dot|
| | |
| coords| file_content' attributes |
| | |
| [coord_var] | variable name between brackets  |
| | |
|   .  | dot|
| | |
| values, attrs, dims, coords| coord_var' attributes |

- If the Coordinates array is not empty print its content.

In [9]:
# define coord_var
coord_var = 'obs'

# Print content of coord_var
if file_content.coords:    
    print(str(coord_var),'values:    ', '\n\n', file_content.coords[coord_var].values, '\n\n')
    print(str(coord_var),'attributes:    ', '\n\n', file_content.coords[coord_var].attrs, '\n\n')
    print(str(coord_var),'dimensions:    ', '\n\n', file_content.coords[coord_var].dims, '\n\n')
    print(str(coord_var),'coordinates:    ', '\n\n', file_content.coords[coord_var].coords, '\n\n')

obs values:     

 [    0     1     2 ... 31800 31801 31802] 


obs attributes:     

 {} 


obs dimensions:     

 ('obs',) 


obs coordinates:     

 Coordinates:
  * obs      (obs) int32 0 1 2 3 4 5 6 ... 31797 31798 31799 31800 31801 31802 




______________________________________________
###  _Metadata:_  
- [x] <span style='color:Blue'> The coordinates are set as an array of. The sequence numbers of observations are used as labels for the data points in the observation array. </span> 
______________________________________________

<a id="12"></a>
### <span style='color:Purple'> Data Variables.  </span>

In [10]:
print('- The number of parameters in Data variables:    ', len(file_content.variables), '\n\n')

print('- Data variables is loaded in:    ', type(file_content.variables), '\n\n')

print('- Data variables content:    ', '\n\n', file_content.variables, '\n\n')

- The number of parameters in Data variables:     32 


- Data variables is loaded in:     <class 'xarray.core.utils.Frozen'> 


- Data variables content:     

 Frozen({'obs': <xarray.IndexVariable 'obs' (obs: 31803)>
array([    0,     1,     2, ..., 31800, 31801, 31802], dtype=int32), 'practical_salinity': <xarray.Variable (obs: 31803)>
array([ 0.      , 34.90207 , 34.86657 , ..., 34.839959, 34.82164 , 34.815624])
Attributes:
    comment:                  Salinity is generally defined as the concentrat...
    long_name:                Practical Salinity
    precision:                4
    coordinates:              time lat lon pressure
    data_product_identifier:  PRACSAL_L2
    standard_name:            sea_water_practical_salinity
    units:                    1
    ancillary_variables:      pressure,conductivity,temperature, 'ctd_time': <xarray.Variable (obs: 31803)>
array([4.934925e+08, 4.934934e+08, 4.934943e+08, ..., 5.221125e+08,
       5.221134e+08, 5.221143e+08])
Attributes

<a id="13"></a>
##### Explore Data Variables.

- Set **data_var** to one of the variables stored in Data variables.
- Get the variable name from the output of the previous cell.
- Use the same syntax used for the Coordinates variables.
- If the Data variables array is not empty print its content.

In [11]:
# define data_var
data_var = 'temperature' 

# Print content of data_var
if file_content.coords:    
    print(data_var,'values:    ', '\n\n', file_content.variables[data_var].values, '\n\n')
    print(data_var,'attributes:    ', '\n\n', file_content.variables[data_var].attrs, '\n\n')
    print(data_var,'dimensions:    ', '\n\n', file_content.variables[data_var].dims, '\n\n')

temperature values:     

 [390696. 482856. 473096. ... 470865. 423278. 423037.] 


temperature attributes:     

 {'comment': 'Seawater temperature unprocessed measurement near the sensor.', 'long_name': 'Seawater Temperature Measurement', 'precision': 0, 'coordinates': 'time lat lon pressure', 'data_product_identifier': 'TEMPWAT_L0', 'units': 'counts'} 


temperature dimensions:     

 ('obs',) 




<a id="14"></a>
#### Explore Data Variables Attributes.
- More information about the data variables are listed in the variable attributes.
- To list the attributes use .key() as follow:

In [12]:
# get the attribute names of a variable.
file_content.variables[data_var].attrs.keys()

dict_keys(['comment', 'long_name', 'precision', 'coordinates', 'data_product_identifier', 'units'])

<a id="14"></a>
##### Extract the Attributes Content.
- **Note:** The information is stored in a dictionary.
- Select an attribute name from the output of the previous cell.
- Set **data_var_attrs** to one of the attribute names.
- Use the syntax below:

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| variables| attributes of file_content |
| | |
| [data_var] | variable name between brackets |
| | |
|   .  | dot|
| | |
| attrs| attributes of data_var |
| | |
| [data_var_attrs] | variable name between brackets |

- If the attributes array is not empty print its content.

In [13]:
# define data_var_attrs
data_var_attrs = 'comment'

# print content of data_var_attrs
if file_content.variables[data_var].attrs:    
    print(data_var, data_var_attrs,':    ', '\n\n', \
          file_content.variables[data_var].attrs[data_var_attrs],'\n\n')

    

temperature comment :     

 Seawater temperature unprocessed measurement near the sensor. 




______________________________________________
###  _Metadata:_  
- [x] <span style='color:Blue'> The attribute comment gives more information about the variable temperature, such as the level of data processing (unprocessed) and the location of the measurement (near the sensor). 
- [x] <span style='color:Blue'> Note that more metadata is available in the other attributes (e.g. units, precision).   </span> 
______________________________________________

<a id="15"></a>
### <span style='color:Purple'> Attributes.  </span>
- Attributes here refer to the rest of the metadata that is not specific to a data variable. 

In [14]:
print('- The number of variables in Attributes:    ',len(file_content.attrs), '\n\n')

print('- Attributes is loaded in:    ', type(file_content.attrs), '\n\n')

print('- Attributes content:    ', '\n\n', file_content.attrs, '\n\n')

- The number of variables in Attributes:     70 


- Attributes is loaded in:     <class 'dict'> 


- Attributes content:     

 {'node': 'RIM01', 'comment': '', 'publisher_email': '', 'sourceUrl': 'http://oceanobservatories.org/', 'collection_method': 'recovered_inst', 'stream': 'ctdmo_ghqr_instrument_recovered', 'featureType': 'point', 'creator_email': '', 'publisher_name': 'Ocean Observatories Initiative', 'date_modified': '2020-05-27T01:56:29.691647', 'keywords': '', 'cdm_data_type': 'Point', 'references': 'More information can be found at http://oceanobservatories.org/', 'Metadata_Conventions': 'Unidata Dataset Discovery v1.0', 'date_created': '2020-05-27T01:56:29.691635', 'id': 'GI03FLMB-RIM01-02-CTDMOG060-recovered_inst-ctdmo_ghqr_instrument_recovered', 'requestUUID': '5bf2116c-ebc1-44a2-9e3a-e4c23a823746', 'contributor_role': '', 'summary': 'Dataset Generated by Stream Engine from Ocean Observatories Initiative', 'keywords_vocabulary': '', 'institution': 'Ocean Observatories In

<a id="16"></a>
##### Explore Attributes Names.
- To list the attribute names, use .keys() as follow:

In [15]:
# get the atttributes variable anmes:
if file_content.attrs:
    print('Attributes Dictionary Keys:    ', '\n\n', \
          file_content.attrs.keys(), '\n\n')

Attributes Dictionary Keys:     

 dict_keys(['node', 'comment', 'publisher_email', 'sourceUrl', 'collection_method', 'stream', 'featureType', 'creator_email', 'publisher_name', 'date_modified', 'keywords', 'cdm_data_type', 'references', 'Metadata_Conventions', 'date_created', 'id', 'requestUUID', 'contributor_role', 'summary', 'keywords_vocabulary', 'institution', 'naming_authority', 'feature_Type', 'infoUrl', 'license', 'contributor_name', 'uuid', 'creator_name', 'title', 'sensor', 'standard_name_vocabulary', 'acknowledgement', 'Conventions', 'project', 'source', 'publisher_url', 'creator_url', 'nodc_template_version', 'subsite', 'processing_level', 'history', 'Manufacturer', 'ModelNumber', 'SerialNumber', 'Description', 'FirmwareVersion', 'SoftwareVersion', 'AssetUniqueID', 'Notes', 'Owner', 'RemoteResources', 'ShelfLifeExpirationDate', 'Mobile', 'AssetManagementRecordLastModified', 'time_coverage_start', 'time_coverage_end', 'time_coverage_resolution', 'geospatial_lat_min', 'geospa

<a id="18"></a>
##### Extract Attributes Information.
- set **attrs_var** to one of the Attribute names.
- If the Attribute content is not empty print its content.

In [16]:
# define attrs_var
attrs_var = 'RemoteResources' 
if file_content.attrs:
    print(attrs_var,':    ', file_content.attrs[attrs_var], '\n\n')

RemoteResources :     [] 




______________________________________________
###  _Metadata:_
- [x] <span style='color:Blue'> The description give s more information about the sensor used to collect data (e.g., sensor used to collect data is a "CTD Mooring (Inductive): CTDMO Series G"). 
- [x] <span style='color:Blue'> More information on the data may be extracted by setting the attrs_var to other attributes names. </span> 
______________________________________________

<a id="19"></a>
## <span style='color:Green'>  Metadata Summary:  </span>

- In the file examined, the metadata exist in any of the file elements described above: **_Dimensions, Attributes, Data variables, Coordinates_**.

- The python code lines used in this notebook is what you need to extract the metadata information in a NetCDF file.

The data file metadata can be grouped in the following categories:
- **Category 1:** The data file information.
    - 'date_created', 'date_modified', 'ShelfLifeExpirationDate', 'RemoteResources',
    - 'AssetManagementRecordLastModified', 'FirmwareVersion', 'SoftwareVersion'
    - 'keywords', 'keywords_vocabulary',
    - 'acknowledgement', 'license', 'history', 'Notes', 'comment',
    - 'title', 'source', 'summary', 'id', 'requestUUID', 'uuid',
    
    
- **Category 2:** The project information.
    - 'Owner', 'institution', 'references', 'project', 
    - 'contributor_name', 'contributor_role',
    - 'creator_name','creator_email', 'creator_url',
    - 'publisher_name', 'publisher_email', 'publisher_url', 
    - 'infoUrl','sourceUrl', 'naming_authority',
   
   
- **Category 3:** The data collection information.
    - 'subsite', 'node', 'sensor','stream', 'Mobile', 'collection_method'
    - 'Manufacturer', ModelNumber, 'SerialNumber', 'Description', 'AssetUniqueID'
    - 'processing_level', 'cdm_data_type', 'feature_Type' 
    - 'time_coverage_start', 'time_coverage_end', 'time_coverage_resolution',
    - 'lat', 'lon',
    - 'geospatial_lat_min', 'geospatial_lat_max', 'geospatial_lat_units', 'geospatial_lat_resolution', 
    - 'geospatial_lon_min', 'geospatial_lon_max', 'geospatial_lon_units', 'geospatial_lon_resolution', 
    - 'geospatial_vertical_units', 'geospatial_vertical_resolution', 'geospatial_vertical_positive'

    
- **Category 4:** The conventions used to create the metadata or data.
    - 'Metadata_Conventions', 'nodc_template_version', 'Conventions', 'standard_name_vocabulary'
  
  
- **Category 5:** The parameter measured information.
    - 'comment'
    - 'long_name' 
    - 'standard_name'
    - 'precision'
    - 'data_product_identifier'
    - 'units'
    - 'values'
    - 'dimension'
    - 'coordinates'
    

<span style='color:Green'> What is Next:  </span>
- The information in the file is what you need to start evaluating the quality of your data. 
- Data quality evaluation will be more in details in the upcoming course labs.

## END