# Import modules

Let's import the modules that we will use.

In [1]:
import xarray as xr # For creating a NetCDF dataset
import numpy as np
import pandas as pd

# Introducing the data

In this example, we will be loading a depth profile of some Chlorophyll A data. However, this example should be relevant for depth profiles of any data.

Nansen Legacy data can be found via the SIOS data access portal. All Nansen Legacy datasets should be returned when filtering using the 'AeN' collection. Please contact data.nleg@unis.no if you have any problems finding or accessing data.

I have downloaded the following dataset into my directory.

# Loading the data

In [2]:
data = xr.open_dataset('AR_PR_CT_58US_2021710.nc')

# Overview of the file

Firstly, let's have a look at the entire dataset.

In [3]:
data

At a glance, we can see it has 5 dimensions; they they show that there are data from 44 different locations. There are 4363 points of depth. This doesn't mean that there are 4363 samples for every station; more likely there is a lot of 'empty' space in this file where a measurement was not taken at a certain depth. This is neccessary for us to use a single depth dimension for a range of different depth profiles, which each sample different depths.

There is no coordinate variable for depth, so we don't know for sure what depths were sampled. However, we can see the geospatial_vertical_min and geospatial_vertical_max attributes are 5 and 4367 respectively, so we can assume that the depths are between these 2 values, separated by 1 m.

We are going to create a depth coordinate variable that we will need later.

In [4]:
depths = np.arange(5,4368,1)
depths

array([   5,    6,    7, ..., 4365, 4366, 4367])

In [5]:
data.coords['DEPTH'] = depths
data.coords

Coordinates:
  * TIME       (TIME) datetime64[ns] 2021-08-26T16:28:23 ... 2021-09-22T04:32:56
  * LATITUDE   (LATITUDE) float32 76.0 81.46 81.8 81.8 ... 83.85 83.84 83.15
  * LONGITUDE  (LONGITUDE) float32 31.22 31.07 30.88 ... -9.537 -9.631 -9.604
  * DEPTH      (DEPTH) int64 5 6 7 8 9 10 11 ... 4362 4363 4364 4365 4366 4367

There is then a whole host of variables and attributes which correspond. The coordinate variables are first, with the same name as their respective dimension. For example TIME(TIME) is the VARIABLE(DIMENSION). The dimension states how many times have been sampled, the variable states what these times are.

Most of the variables have two dimensions; depth and time. Latitude and longitude are only used in coordinate variables, but we can assume here that each coordinate corresponds to a single time. There are other ways to create a netcdf file to more explicitly state this, by having longitude and latitude variables that each have the dimension of time, thus linking them together. An important point to take away is that different people have different ways of doing things, but we should be able to easily understand what has been done and adapt our code accordingly. 

To look at all of the attributes:

In [6]:
data.attrs

{'title': 'Arctic Ocean - In Situ Observation Copernicus',
 'qc_manual': 'Recommendations for in-situ data Near Real Time Quality Control https://doi.org/10.13155/36230',
 'contact': 'cmems-service@imr.no',
 'format_version': '1.4',
 'distribution_statement': 'These data follow Copernicus standards; they are public and free of charge. User assumes all risk for use of data. User must display citation in any publication or product using data. User must contact PI prior to any commercial use of data.',
 'citation': 'These data were collected and made freely available by the Copernicus project and the programs that contribute to it ',
 'naming_authority': 'Copernicus Marine In Situ',
 'data_assembly_center': 'IMR',
 'update_interval': 'void',
 'area': 'Arctic Ocean',
 'author': '',
 'Conventions': 'CF-1.6 Copernicus-InSituTAC-FormatManual-1.42 Copernicus-InSituTAC-SRD-1.5 Copernicus-InSituTAC-ParametersList-3.2.0 ACDD-1.3',
 'data_mode': 'R',
 'comment': '',
 'history': '',
 'references': 

The 'Conventions' attribute is important. It tells us what standards have been followed when creating the file. If you are not sure what is meant by 'creator_name' for example, you can look it up and find a definition for this term.

To look at individual attributes:

In [7]:
data.attrs['Conventions']

'CF-1.6 Copernicus-InSituTAC-FormatManual-1.42 Copernicus-InSituTAC-SRD-1.5 Copernicus-InSituTAC-ParametersList-3.2.0 ACDD-1.3'

To see all the variables:

In [8]:
data.data_vars

Data variables: (12/17)
    TIME_QC      (TIME) float32 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
    POSITION_QC  (POSITION) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
    DIRECTION    (TIME) object b'D' b'D' b'D' b'D' b'D' ... b'D' b'D' b'D' b'D'
    PRES         (TIME, DEPTH) float32 ...
    PRES_QC      (TIME, DEPTH) float32 ...
    TEMP         (TIME, DEPTH) float64 ...
    ...           ...
    TEMP_QC      (TIME, DEPTH) float32 ...
    PSAL_QC      (TIME, DEPTH) float32 ...
    FLU2_QC      (TIME, DEPTH) float32 ...
    CNDC_QC      (TIME, DEPTH) float32 ...
    SVEL_QC      (TIME, DEPTH) float32 ...
    CCOMD003_QC  (TIME, DEPTH) float32 ...

To see an individual data variable:

In [9]:
data['PSAL']

There are variable attributes. The standard_name refers to the name of the variable from a controlled vocabulary, the CF-1.6 standards. We can find a definition for this variable by following the link below.

The long_name is provided by the data creator, in their own words. 

# Dumping to Excel file

Some people prefer to work with the data in a format that they're more familiar with. To output as CSV or XLSX, we must first create a dataframe. For individual variables:

In [10]:
df = data['TEMP'].to_dataframe()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,TEMP
TIME,DEPTH,Unnamed: 2_level_1
2021-08-26 16:28:23,5,5.572
2021-08-26 16:28:23,6,5.784
2021-08-26 16:28:23,7,5.678
2021-08-26 16:28:23,8,5.709
2021-08-26 16:28:23,9,5.644
...,...,...
2021-09-22 04:32:56,4363,
2021-09-22 04:32:56,4364,
2021-09-22 04:32:56,4365,
2021-09-22 04:32:56,4366,


And for multiple variables, as below. Note the double square brackets. The first set say 'take something from within my xarray object', the second say 'this is a list'.

In [11]:
df = data[['TEMP','PSAL', 'PRES', 'SVEL']].to_dataframe()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,TEMP,PSAL,PRES,SVEL
DEPTH,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,2021-08-26 16:28:23,5.572,34.878,5.0,1472.96
5,2021-08-28 04:10:36,0.209,36.196,5.0,1451.77
5,2021-08-28 12:07:38,0.493,34.791,5.0,1451.19
5,2021-08-29 00:41:23,0.502,30.088,8.0,1445.01
5,2021-08-29 04:42:50,0.733,34.407,5.0,1451.76
...,...,...,...,...,...
4367,2021-09-17 07:52:22,,,,
4367,2021-09-19 12:20:32,,,,
4367,2021-09-21 04:37:00,,,,
4367,2021-09-21 07:19:29,,,,


Now let's write that dataframe out to an xlsx file

In [12]:
df.to_excel('ctd_data.xlsx')