# Explore the Data File Metadata.

The data file to be used in Module 2 Lab assignment is a NetCDF file. The following is a brief description of the file type, its data structure, and why it is suitable for the Metadata exercise.

#### File Type: 

- The file type used for this exercise is the Network Common Data Form file (NetCDF) with file extension "*.nc".
- NetCDF is based on a binary file storage mechanism.
- It is a self-documenting data storage and data access method that is used in the geoscience filed.



 #### Data Structure:
 
- NetCDF is commonly used for storing and interchanging multidimensional scientific data variables (Schema 1).
- This file type includes data (observation values) and descriptive information about the data (Metadata). 
- Data are displayed through a dimension such as time, date, latitude, longitude, or depth. 


##### Why NetCDF File?

- The NetCDF files are easy to manipulate.
- Metadata is entered in a structured way, so it is easy to retrieve.
- Metadata entries can use a free text format and a good number of words.
- Other file types do not all share these characteristics. Such files are: excel (.xlsx), text (.txt), and comma separated (.csv) file types.

### Schema 1.
#### NetCDF Multidimensional Data Structure Example.
[Schema 1 Link](https://drive.google.com/open?id=1U6FIN_HOADG2O00Pcd6OgmIIDms7KzSe)
 

### <font color="Green"> Outline   </font>

- In this exercise we are going to execute the following steps to learn how to work with the NetCDF files and extract metadata information:

    * [Step 0. Import Python Packages.](#0)
    * [Step 1. Mounting Google Drive.](#1)
    * [Step 2. Access, Load, and Print the Data File Content.](#11)

        * [Access Data File.](#12)

        * [Define Data File.](#2)
        
        * [Load Data File.](#3)
        
        * [Print Data File.](#4)
        
        * [Data File Content.](#5)
        
    * [Step 3. Extract File Elements Separately.](#6)
    
        * [Python Syntax.](#7)
        
        * [Dimensions](#8)
        
            * [Explore Dimensions Variables.](#9)
            
        * [Coordinates.](#10)
         
            * [Explore Coordinates Variables.](#11)
             
        * [Data Variables.](#12)
         
            * [Explore Data Variables.](#13)
            
            * [Explore Data Variables Attributes.](#31)
             
            * [Extract the Attributes Content.](#14)
             
        * [Attributes.](#15)
         
            * [Explore Attributes Variables.](#16)
             
            * [Extract Attributes Information.](#17)
             
    * [Metadata Summary.](#18)
        

<font color="Orange" > **Attention:** </font> 
- This lab learning material is a reference for the rest of the course labs. 
- To run the notebook, you need to follow the septs in order.
- Do not forget to run you cell before you move on to the next one. 
- This is especially true for the code cells: the output of a cell may be an input in the next cell.

<a id="#0"></a>
## Step 0. Import Python Packages. </font>
To run the code lines in this notebook you need xarray and pandas python packages.
Use the import function to access the packages in the Python library.

<font color="Purple" > **Note:** </font> 
- If you need to add more packages you can do so in a code cell. 

- The package needs to be imported before it is called in the command line.


In [0]:
# xr, pd are the package IDs used to call python's function to run code lines.
import xarray as xr
import pandas as pd
import os

<a id="1"></a>
## Step 1. Mounting Google Drive.
-  To access files on google colab run the following 2 command lines and follow the instructions:

- Instructions:
  - when asked to enter you code, click on the "Go to this URL in a browser",
  - then click on your account and copy the code,
  - enter the copied code in "Enter your authorization code" and click enter.

  - you should get "Mounted at /content/drive"

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


<a id="11"></a>
## Step 2. Access, Load , and Print the Data File Content.  </font>


<a id="12"></a>
### Access Data File.
- Change directory to where your data file is stored using the following command:



In [0]:
os.chdir('/content/drive/Shared drives/GEOS689-776/Module2_NetCDF_Files')

  - Check the directory you are in and list its content using the following commands:

In [0]:
# use pwd to check which directory you are in
!pwd
# use ls to list files and directories on your drive
!ls

<a id="2"></a>
### Define Data File.
- Since you are in the directory where the data file is stored you only need to set the variable **filename** to the name of the data file:

 - filename = 'your data file name'  
 
 **Note**: Your data file name should end with .nc

In [0]:
filename = 'deployment0004_GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument_20161008T080001-20161030T120001.nc'

<a id="3"></a>
### Load Data File. 
- Use xarray (**xr**) to open and read the data file content.
- Set **file_content** as a variable to record the data file content.

In [0]:
# Since this is a NetCDF4 file, you need to install the netcdf4 library.
! pip install netcdf4
file_content = xr.open_dataset(filename,mask_and_scale=False)

<a id="4"></a>
### Print Data File.
- Use Python functions **print, len, type** to get more information about the content of the data file.

In [0]:
# Print the length of the file_content using the print and len functions

print('- The number of variables in the file:    ', len(file_content), '\n\n')

# Print the type of the file_content using the print and type functions

print('- The file variables are loaded in:    ', type(file_content), '\n\n')

# Print the content of the file

print('- File Content:    ', '\n\n', file_content, '\n\n')


- The number of variables in the file:     30 


- The file variables are loaded in:     xarray.core.dataset.Dataset 


- File Content:     

 <xarray.Dataset>
Dimensions:                                  (obs: 20)
Coordinates:
  * obs                                      (obs) int32 0 1 2 3 ... 16 17 18 19
Data variables:
    time                                     (obs) datetime64[ns] ...
    deployment                               (obs) int32 ...
    id                                       (obs) |S36 ...
    conductivity                             (obs) int32 ...
    ctd_time                                 (obs) datetime64[ns] ...
    driver_timestamp                         (obs) datetime64[ns] ...
    inductive_id                             (obs) uint8 ...
    ingestion_timestamp                      (obs) datetime64[ns] ...
    internal_timestamp                       (obs) datetime64[ns] ...
    port_timestamp                           (obs) datetime64[ns] ...
    preferre

<a id="5"></a>
### Data File Content.

**Data File Elements:**
The NetCDF files are organized as follow:
- <font color="Purple"> Dimensions: </font>
   - Variables:  e.g., obs 
        - Parameters. e.g., values
- <font color="Purple"> Coordinates: </font>
   - Variables:  e.g., obs
        - Parameters.  e.g., values
- <font color="Purple"> Data variables: </font>
   - Variables:  e.g., time
        - Parameters. e.g., values, unit, long_name
- <font color="Purple"> Attributes: </font>
   - Variables: e.g., node
        - Parameters. e.g., information
        
 


> ###  _Metadata:_  
- [x] <font color="blue">   Metadata exists in any of the elements described above. The rest of the notebook will show how to access the metadata information. </font>


<a id="6"></a>
## Step 3. Extract File Elements Seperately.


<a id="7"></a>
### Python Syntax:    
**file_content.attributes** 

|  || ||
| -  | -  | -  | -  |
|<font color="blue"> **syntax:** </font> | file_content   | .  |  _attributes_ |
|**Description**|defined variable|dot |replace by file_content's attributes: <font color="Purple"> coords, variables, dims </font> |



<a id="8"></a>
###  <font color="Purple"> Dimensions.  </font> ( .dims)

In [0]:
print('- The number of variables in Dimensions:    ', len(file_content.dims), '\n\n')

print('- Dimensions is loaded in:    ', type(file_content.dims), '\n\n')

print('- Dimensions content:    ', file_content.dims, '\n\n')


- The number of variables in Dimensions:     1 


- Dimensions is loaded in:     xarray.core.utils.Frozen 


- Dimensions content:     Frozen(SortedKeysDict({'obs': 20})) 




<a id="9"></a>
##### Explore Dimensions variables. ( .dims[ ])
- Set **dims_var** to one of the variables stored in Dimensions.
- Get the variable name from the output of the previous cell.
- Use the following syntax: **file_content.dims[dims_var]**

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| dims| attributes of file_content |
| | |
| [dims_var] | variable name between brackets |

- If the Dimensions content is not empty print its content.

In [0]:
# define dims_var
dims_var = 'obs'

# Print the content of dims_var

if file_content.dims:
    print(dims_var, ':    ', file_content.dims[dims_var], '\n\n')

obs :     20 





> ###  _Metadata:_ 
- [x] <font color="Blue"> We expect to have 20 data points for each parameter measured. </font> 


<a id="10"></a>
### <font color="Purple"> Coordinates. </font> (   .coords)

In [0]:
print('- The number of parameters in Coordinates:    ', len(file_content.coords), '\n\n')

print('- Coordinates is loaded in:    ', type(file_content.coords), '\n\n')

print('- Coordinates content:    ', '\n\n', file_content.coords, '\n\n')

- The number of parameters in Coordinates:     1 


- Coordinates is loaded in:     xarray.core.coordinates.DatasetCoordinates 


- Coordinates content:     

 Coordinates:
  * obs      (obs) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 




<a id="11"></a>
##### Explore Coordinates variables. ( .coords[ ].values or (.attrs .dims .coords)   )

- Set **coord_var** to one of the variables stored Coordinates.
- Get the variable name from the output of the previous cell.
- Use the variable attributes to extract the parameters' content: **values, attrs, dims, coord**
- The syntax to use is:

|syntax|description|
|-| -|
| file_content | defined variable  |
| | |
|   .  | dot|
| | |
| coords| file_content' attributes |
| | |
| [coord_var] | variable name between brackets  |
| | |
|   .  | dot|
| | |
| values, attrs, dims, coords| coord_var' attributes |

- If the Coordinates array is not empty print its content.

In [0]:
# define coord_var
coord_var = 'obs'

# Print content of coord_var
if file_content.coords:    
    print(str(coord_var),'values:    ', '\n\n', file_content.coords[coord_var].values, '\n\n')
    print(str(coord_var),'attributes:    ', '\n\n', file_content.coords[coord_var].attrs, '\n\n')
    print(str(coord_var),'dimensions:    ', '\n\n', file_content.coords[coord_var].dims, '\n\n')
    print(str(coord_var),'coordinates:    ', '\n\n', file_content.coords[coord_var].coords, '\n\n')

obs values:     

 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] 


obs attributes:     

 {} 


obs dimensions:     

 ('obs',) 


obs coordinates:     

 Coordinates:
  * obs      (obs) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 





> ###  _Metadata:_  
- [x] <font color="Blue"> The coordinates are set as an array of. The sequence numbers of observations are used as labels for the data points in the observation array. </font> 

<a id="12"></a>
### <font color="Purple"> Data Variables.  </font> (  .variables)

In [0]:
print('- The number of parameters in Data variables:    ', len(file_content.variables), '\n\n')

print('- Data variables is loaded in:    ', type(file_content.variables), '\n\n')

print('- Data variables content:    ', '\n\n', file_content.variables, '\n\n')

- The number of parameters in Data variables:     31 


- Data variables is loaded in:     xarray.core.utils.Frozen 


- Data variables content:     

 Frozen({'obs': <xarray.IndexVariable 'obs' (obs: 20)>
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19], dtype=int32), 'time': <xarray.Variable (obs: 20)>
array(['2016-10-08T08:00:01.000000000', '2016-10-08T12:00:01.000000000',
       '2016-10-19T00:00:01.000000000', '2016-10-19T04:00:01.000000000',
       '2016-10-19T08:00:01.000000000', '2016-10-19T12:00:01.000000000',
       '2016-10-20T00:00:01.000000000', '2016-10-20T04:00:01.000000000',
       '2016-10-20T08:00:01.000000000', '2016-10-20T12:00:01.000000000',
       '2016-10-20T16:00:01.000000000', '2016-10-25T12:00:01.000000000',
       '2016-10-25T16:00:01.000000000', '2016-10-25T20:00:01.000000000',
       '2016-10-26T00:00:01.000000000', '2016-10-26T04:00:01.000000000',
       '2016-10-26T08:00:01.000000000', '2016-10-26T12:00:01.0000

<a id="13"></a>
##### Explore Data Variables. ( .variables[ ].values or ( .dims . attrs))

- Set **data_var** to one of the variables stored in Data variables.
- Get the variable name from the output of the previous cell.
- Use the same syntax used for the Coordinates variables.
- If the Data variables array is not empty print its content.

In [0]:
# define data_var
data_var = 'temperature' 

# Print content of data_var
if file_content.coords:    
    print(data_var,'values:    ', '\n\n', file_content.variables[data_var].values, '\n\n')
    print(data_var,'attributes:    ', '\n\n', file_content.variables[data_var].attrs, '\n\n')
    print(data_var,'dimensions:    ', '\n\n', file_content.variables[data_var].dims, '\n\n')

temperature values:     

 [190588 238965 196272 226870 226541 186872 225630 227213 226823 188281
 226229 220177 219794 219456 219543 219077 218676 195684 202063 215955] 


temperature attributes:     

 {'_FillValue': -9999999, 'comment': 'Seawater temperature unprocessed measurement near the sensor.', 'long_name': 'Seawater Temperature Measurement', 'precision': 0, 'coordinates': 'time lat lon pressure', 'data_product_identifier': 'TEMPWAT_L0', 'units': 'counts'} 


temperature dimensions:     

 ('obs',) 




<a id="14"></a>
##### Explore Data Variables Attributes. (  .variables[ ].attrs.keys( ))
- More information about the data variables are listed in the variable attributes.
- To list the attributes use .key() as follow:

In [0]:
# get the attribute names of a variable.
file_content.variables[data_var].attrs.keys()

dict_keys(['_FillValue', 'comment', 'long_name', 'precision', 'coordinates', 'data_product_identifier', 'units'])

<a id="14"></a>
##### Extract the Attributes Content. (.variables[ ].attrs[ ])
- **Note:** The information is stored in a dictionary.
- Select an attribute name from the output of the previous cell.
- Set **data_var_attrs** to one of the attribute names.
- Use the syntax below:

|syntax|description|
|-| -|
| file_content | the variable previously defined   |
| | |
|   .  | dot|
| | |
| variables| attributes of file_content |
| | |
| [data_var] | variable name between brackets |
| | |
|   .  | dot|
| | |
| attrs| attributes of data_var |
| | |
| [data_var_attrs] | attribute key name between brackets |

- If the attributes array is not empty print its content.

In [0]:
# define data_var_attrs
data_var_attrs = 'comment'

# print content of data_var_attrs
if file_content.variables[data_var].attrs:    
    print(data_var, data_var_attrs,':    ', '\n\n', \
          file_content.variables[data_var].attrs[data_var_attrs],'\n\n')

    

temperature comment :     

 Seawater temperature unprocessed measurement near the sensor. 





> ###  _Metadata:_  
- [x] <font color="Blue"> The attribute comment gives more information about the variable temperature, such as the level of data processing (unprocessed) and the location of the measurement (near the sensor). </font> 
- [x] <font color="Blue"> Note that more metadata is available in the other attributes (e.g. units, precision).   </font> 


<a id="15"></a>
### <font color="Purple"> Attributes.  </font> ( .attrs)
- Attributes here refer to the rest of the metadata that is not specific to a data variable. 

In [0]:
print('- The number of variables in Attributes:    ',len(file_content.attrs), '\n\n')

print('- Attributes is loaded in:    ', type(file_content.attrs), '\n\n')

print('- Attributes content:    ', '\n\n', file_content.attrs, '\n\n')

- The number of variables in Attributes:     70 


- Attributes is loaded in:     <class 'dict'> 


- Attributes content:     

 {'node': 'RIM01', 'comment': '', 'publisher_email': '', 'sourceUrl': 'http://oceanobservatories.org/', 'collection_method': 'telemetered', 'stream': 'ctdmo_ghqr_sio_mule_instrument', 'featureType': 'point', 'creator_email': '', 'publisher_name': 'Ocean Observatories Initiative', 'date_modified': '2019-10-17T03:16:56.980560', 'keywords': '', 'cdm_data_type': 'Point', 'references': 'More information can be found at http://oceanobservatories.org/', 'Metadata_Conventions': 'Unidata Dataset Discovery v1.0', 'date_created': '2019-10-17T03:16:56.980553', 'id': 'GP03FLMB-RIM01-02-CTDMOG060-telemetered-ctdmo_ghqr_sio_mule_instrument', 'requestUUID': 'be8cec74-60c4-4df9-b09b-07063ebf642d', 'contributor_role': '', 'summary': 'Dataset Generated by Stream Engine from Ocean Observatories Initiative', 'keywords_vocabulary': '', 'institution': 'Ocean Observatories Initiative

<a id="16"></a>
##### Explore Attributes Names. ( .attrs.keys( ))
- To list the attribute names, use .keys() as follow:

In [0]:
# get the atttributes variable anmes:
if file_content.attrs:
    print('Attributes Dictionary Keys:    ', '\n\n', \
          file_content.attrs.keys(), '\n\n')

Attributes Dictionary Keys:     

 dict_keys(['node', 'comment', 'publisher_email', 'sourceUrl', 'collection_method', 'stream', 'featureType', 'creator_email', 'publisher_name', 'date_modified', 'keywords', 'cdm_data_type', 'references', 'Metadata_Conventions', 'date_created', 'id', 'requestUUID', 'contributor_role', 'summary', 'keywords_vocabulary', 'institution', 'naming_authority', 'feature_Type', 'infoUrl', 'license', 'contributor_name', 'uuid', 'creator_name', 'title', 'sensor', 'standard_name_vocabulary', 'acknowledgement', 'Conventions', 'project', 'source', 'publisher_url', 'creator_url', 'nodc_template_version', 'subsite', 'processing_level', 'history', 'Manufacturer', 'ModelNumber', 'SerialNumber', 'Description', 'FirmwareVersion', 'SoftwareVersion', 'AssetUniqueID', 'Notes', 'Owner', 'RemoteResources', 'ShelfLifeExpirationDate', 'Mobile', 'AssetManagementRecordLastModified', 'time_coverage_start', 'time_coverage_end', 'time_coverage_resolution', 'geospatial_lat_min', 'geospa

<a id="18"></a>
##### Extract Attributes Information. ( .attrs[ ])
- set **attrs_var** to one of the Attribute names.
- If the Attribute content is not empty print its content.

In [0]:
# define attrs_var
attrs_var = 'sourceUrl' 
if file_content.attrs:
    print(attrs_var,':    ', file_content.attrs[attrs_var], '\n\n')

sourceUrl :     http://oceanobservatories.org/ 




In [0]:
# define attrs_var
attrs_var = 'references' 
if file_content.attrs:
    print(attrs_var,':    ', file_content.attrs[attrs_var], '\n\n')

references :     More information can be found at http://oceanobservatories.org/ 




In [0]:
# define attrs_var
attrs_var = 'date_created' 
if file_content.attrs:
    print(attrs_var,':    ', file_content.attrs[attrs_var], '\n\n')

date_created :     2019-10-17T03:16:56.980553 




> ###  _Metadata:_
- [x] <font color="Blue"> The description give s more information about the sensor used to collect data (e.g., sensor used to collect data is a "CTD Mooring (Inductive): CTDMO Series G"). </font>
- [x] <font color="Blue"> More information on the data may be extracted by setting the attrs_var to other attributes names. </font> 

<a id="19"></a>
## <font color="Green">  Metadata Summary:  </font>

- In the file examined, the metadata exist in any of the file elements described above: **_Dimensions, Attributes, Data variables, Coordinates_**.

- The python code lines used in this notebook is what you need to extract the metadata information in a NetCDF file.

- You can use the last code cell to retrieve any of the information in the file attributes and see if you can describe the data in the file and its provenance.  


The data file metadata can be grouped in the following categories:
- **Category 1:** The data file information.
    - 'date_created', 'date_modified', 'ShelfLifeExpirationDate', 'RemoteResources',
    - 'AssetManagementRecordLastModified', 'FirmwareVersion', 'SoftwareVersion'
    - 'keywords', 'keywords_vocabulary',
    - 'acknowledgement', 'license', 'history', 'Notes', 'comment',
    - 'title', 'source', 'summary', 'id', 'requestUUID', 'uuid',
    
    
- **Category 2:** The project information.
    - 'Owner', 'institution', 'references', 'project', 
    - 'contributor_name', 'contributor_role',
    - 'creator_name','creator_email', 'creator_url',
    - 'publisher_name', 'publisher_email', 'publisher_url', 
    - 'infoUrl','sourceUrl', 'naming_authority',
   
   
- **Category 3:** The data collection information.
    - 'subsite', 'node', 'sensor','stream', 'Mobile', 'collection_method'
    - 'Manufacturer', ModelNumber, 'SerialNumber', 'Description', 'AssetUniqueID'
    - 'processing_level', 'cdm_data_type', 'feature_Type' 
    - 'time_coverage_start', 'time_coverage_end', 'time_coverage_resolution',
    - 'lat', 'lon',
    - 'geospatial_lat_min', 'geospatial_lat_max', 'geospatial_lat_units', 'geospatial_lat_resolution', 
    - 'geospatial_lon_min', 'geospatial_lon_max', 'geospatial_lon_units', 'geospatial_lon_resolution', 
    - 'geospatial_vertical_units', 'geospatial_vertical_resolution', 'geospatial_vertical_positive'

    
- **Category 4:** The conventions used to create the metadata or data.
    - 'Metadata_Conventions', 'nodc_template_version', 'Conventions', 'standard_name_vocabulary'
  
  
- **Category 5:** The parameter measured information.
    - 'comment'
    - 'long_name' 
    - 'standard_name'
    - 'precision'
    - 'data_product_identifier'
    - 'units'
    - 'values'
    - 'dimension'
    - 'coordinates'
    

### <font color="Green"> What is Next:  </font>
- The information in the file, is what you need to start evaluating the quality of your data.  
- Data quality evaluation will be more in details in the upcoming course labs.

## END