## EPS/ESE 135: Observing the Ocean
### Data Analysis Assignment 3 (Intro): Mooring data

This week's assignment has 3 goals:
1. introduce the Python [xarray](https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html) package. (Here's another reminder to check out the [Project Pythia xarray primer](https://foundations.projectpythia.org/core/xarray/xarray-intro/), which rocks.)
2. practice using the GSW tools and other Python skills from previous assignments.
3. explore a time series of oceanographic data from the North Atlantic.

Like last week, you will first go through the code in this notebook, which will introduce the dataset and Python tools you will be working with. This time you will use the same dataset for the assignment. You will submit the assignment notebook (`03_mooring_assignment.ipynb`) on Canvas, either as an .ipynb file or PDF.

### The dataset

You will be looking at a timeseries from the CF3 mooring, which is part of the [Overturning in the Subpolar North Atlantic Program](https://www.o-snap.org) array. This mooring is located on the continental shelf off the southeast coast of Greenland, with a bottom depth of approximately 180 m. In your first homework assignment, you plotted temperature and pressure records from the deepest sensor on this mooring. The schematic below shows the entire OSNAP array with arrows indicating the large scale ocean circulation in this region, including the southward-flowing East Greenland Current system. Strong ocean currents are sometimes observed at the location of CF3.

![OSNAP array](osnap_array_schematic.jpg)

So far we have worked with data saved as csv (comma-separated value) files. This works well for data sets that vary along one dimension. For example, the mooring timeseries in homework 1 had data recorded at a fixed location and depth, but a range of times. The CTD profiles had data recorded at a single location and time, but a range of depths.

Now we will look at a data set that varies along two dimensions: depth and time.

This file is stored as a NetCDF, a file type that can store multidimensional data with descriptive metadata. We will use the Python package xarray to work with this data set.

In [None]:
# import python packages

import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import gsw

In [None]:
# import netcdf file as xarray data set
cf3 = xr.open_dataset("CF3.nc")

# display summary of data set
cf3

In this initial summary:
* You can see the first few and last few timestamps by clicking on the icon to the far right next to **time**, which show that this time series covers 8 years from August 2014 to August 2022, and the interval between subsequent measurements is 15 minutes.

* You can also see that there are instruments at 3 depths. If you click on the sheet of paper (second icon from the right) next to **depth**, it will display the metadata for that variable including the units of depth, which is meters, i.e. the sensors were deployed at 50 m, 100 m, and 170 m.

* Finally, if you click on the arrow next to `Attributes`, it will display the metadata associated with the overall data set.

### Plotting with xarray

xarray has some simple built-in plotting functions that you can use to visualize the data variables. 

Note that `cf3['temperature']` and `cf3.temperature` are different ways of referring to/accessing the same data variable! You can choose the syntax you prefer.

In [None]:
# make a line plot of the temperature time series for all three depths
cf3['temperature'].plot.line(x='time')

This function automatically uses the metadata to generate a legend and axis labels, which is helpful, but in this case we should probably choose our own more concise labels. We can also make the figure larger so it's not so scrunched.

In [None]:
# make the line plot again, and then set title and y-axis label
cf3['temperature'].plot.line(x='time', figsize = (10,5))
plt.ylabel('in-situ temperature [°C]')
plt.title('CF3 mooring temperature record')
plt.grid()

Here are some features I can see from this figure:
* there appears to be a seasonal cycle in temperature at each of the 3 depths.
* the 50 m instrument (blue line) has the highest highs and the lowest lows; the 170 m temperature (green) appears less variable, with 100 m (orange) somewhere in between.
* the blue line disappears for about one year from 2019-2020.

Now let's plot the pressure time series the same way:

In [None]:
# plot pressure time series and set figure size
cf3['pressure'].plot.line(x='time', figsize=(10,5))

# add labels, title, and grid lines
plt.ylabel('pressure [dbar]')
plt.title('CF3 mooring pressure record')
plt.grid()

There are a few things we should note here. First, look at the green line. This is the pressure record that you plotted in assignment 1. The labeled depth is 170 m. Let's check what the pressure *should* be at 170 m:

In [None]:
z_nom = 170 # nominal depth in meters
lat = 60 # latitude in degrees north

p_nom = gsw.p_from_z(-z_nom,lat) # calculate pressure at given depth using GSW toolbox

# print output as sentence using formatted strings:
# this rounds each value to the given number of decimal places
print(f'The sea pressure at nominal depth {z_nom:.0f} m is {p_nom:.1f} dbar.')

The green line is at over 180 dbar, so that means that this instrument was deployed about 10 m deeper than intended. That's okay, but something we need to be aware of as we interpret the data.

Now consider the orange line, which represents data that are supposed to be collected at 100 m. A few things I notice here:
* Every two years, there is a shift in the "baseline" pressure of this line. e.g. from 2014-16 the lowest pressure is slightly under 100 dbar. From 2016-18 it increases slightly. From 2018-20, it is a bit higher than 100 dbar. From 2020-22, it is again slightly lower than 100 dbar.
* A lot of the time the pressure is much higher than 100 m, but only episodically.
* There appears to be a seasonality in these events where the instrument is measuring higher pressures.

The blue line (50 m) is more or less analogous to the orange line. The episodic increases in pressure appear to mirror the orange line, but with a somewhat larger magnitude.

Summarizing these observations, there are three key points to be aware of. 

1. We will refer to the labeled depths (50 m, 100 m, 170 m) as the "nominal" depths -- these are the intended depths of those sensors, but it's not uncommon for them to be slightly off, because the sea floor is not flat!
2. The mooring was redeployed every two years (2014, 2016, 2018, and 2020) and the actual deployment depth shifts slightly each time. For this assignment, this is not an issue, but if we were interpreting long term changes, it would be important to account for any effect of changes in the measurement depth.
3. The episodic increases in observed pressure are what we refer to as "blowdowns." These are events where strong ocean currents push the mooring floats down, which looks like this:

![mooring blowdown from https://proteusds.com/mooring-deflection-data-quality/](mooring-knockdown-excursion-tilt.png)

In this illustration `U` represents the direction of the ocean current. The dashed line represents the neutral position of the mooring. `A` represents the horizontal displacement of the blown-down mooring, and `B` represents the resulting vertical displacement. The blue shading in the background represents layers of increasing density. We can imagine that the instrument next to the letter C is measuring denser water than it would in its neutral position, not necessarily because the water there has become denser but because it has been displaced deeper in the water column.

Notice that these events don't directly affect pressure at the deepest instrument because it is rigidly attached to a tripod (similar to what you saw in class on 9/23). What does show up in the pressure record at the bottom are tides, as you saw in assignment 1.

### Subsets of xarray data sets

For parts of this week's assignment you will be working with a shorter section of the time series. xarray allows you to select subsets by indices (using `isel(time=____)`, i.e. choose the first 2880 timestamps using `time=[0:2880]`) or by coordinates (using `sel(time=____)`, i.e. choose a span of 30 days by entering the start and end dates). 

These can also be used with the depth coordinate. If you wanted to see only the record from the 50 m instrument, you could either use:
* `isel(depth=0)`, because it is the first depth value (index 0)
* `sel(depth=50)`, to directly specify the desired value of the depth coordinate

There are two python functions that simplify this further: 
* `range(maxvalue)` will generate a list of integers beginning with 0 and up to (but excluding) `maxvalue`. This is good for specifying indices. ([examples](https://www.w3schools.com/python/ref_func_range.asp))
* `slice(minvalue,maxvalue)` is a more general option that allows you to specify other types of beginning and end values, including start and end dates. ([examples](https://www.w3schools.com/python/ref_func_slice.asp))

Here are examples of both approaches:

In [None]:
# select by indices using isel and range
# this is analogous to how you plotted one month of data last time
cf3_1month = cf3.isel(time=range(2880))

# plot pressure variable for this subset of the data
cf3_1month.pressure.plot.line(x='time')

# add title
plt.title('CF3 mooring pressure record, August-September 2014')

In [None]:
# select by time coordinates using sel and slice
# let's choose the first 2-year deployment, from August 16 2014 to August 10 2016
cf3_d1 = cf3.sel(time=slice("2014-08-16","2016-08-10"))

# plot pressure variable for this subset of the data and expand figure size
cf3_d1.pressure.plot.line(x='time',figsize = (10,5))

# add title
plt.title('CF3 mooring pressure record, 2014-16')

### Identifying blowdowns

Blowdown events can bias ocean mooring records. Python has some useful tools for weeding out data points that fall outside of a normal range.

Let's use the record from the nominal 50 m instrument during deployment 1 to work through an example of a simple method for identifying blowdowns. Let's define a blowdown as any time that the observed pressure is greater than the baseline pressure by 10 dbar or more.

We'll use the median pressure as our baseline:

In [None]:
# calculate median pressure for 50 m instrument
cf3_d1.sel(depth=50).pressure.median()

In [None]:
# save this value as a variable
pres50_baseline = float(cf3_d1.sel(depth=50).pressure.median())

# set delta_p, our definition of blowdown for this example
delta_p = 10

# we define the threshold blowdown pressure as baseline + delta_p
pres50_blowdown = pres50_baseline + delta_p

# print the resulting values as a sentence
print(f'The median pressure is {pres50_baseline:.1f} dbar, and the blowdown threshold is {pres50_blowdown:.1f} dbar.')

We can use the greater and less than signs (`>` and `<`) as operators in Python to compare all of the values in the pressure record to this threshold. This returns a "boolean array" which is just a list of True/False values. It effectively tells us which time points in the record are below or above the pressure threshold.

In [None]:
# find measurements with pressure below the threshold, i.e. within the normal range
inrange = cf3_d1.sel(depth=50).pressure <= pres50_blowdown

# find measurements with pressure above the threshold, i.e. during a blowdown
notinrange = cf3_d1.sel(depth=50).pressure > pres50_blowdown

# print a preview of the "inrange" array
inrange.values

`True` is assigned a value of 1 and `False` is assigned a value of 0, so you can simply calculate a sum of the `notinrange` array to find out how many of the measurements are associated with a blowdown as we've defined it here.

In [None]:
# calculate the number of blowdown values by summing the True/False array
bd_obs = sum(notinrange.values)

# print the result
print(f'{bd_obs} measurements were taken during a blowdown event.')

We can use this to create a "masked" version of the time series that removes data points associated with a blowdown event.

In [None]:
# the where function tells xarray to keep only the data points associated with a "True" value
cf3_d1_masked = cf3_d1.where(inrange)

# plot the masked pressure timeseries for all depths
cf3_d1_masked.pressure.plot.line(x='time')
plt.title('Pressure with blowdown events removed')

While this is a somewhat crude method, you can see that removing the data points where a blowdown was identified at 50 m also removed the large blowdowns at 100 m.

### Using groupby to calculate monthly statistics

You can also use xarray to easily "group" your data by month and calculate statistics on it. In other words, it will go through the entire record and group data from January of each year, February of each year, etc. This is a very useful tool for calculating climatologies, or monthly averages of properties like temperature and salinity, which you will do in your homework assignment, using the function `[dataset].groupby('time.month').mean()`.

Here's what it looks like if we calculate the mean fraction of events that are "in range" and "not in range" for each month during 2014-16.

In [None]:
# create a figure with one row and two columns of subplots
# the left subplot is axis[0] and the right subplot is axis[1]
fig, axis = plt.subplots(1,2, figsize=(10,5))

# this is a combination of a bunch of different functions! from left to right:
# notinrange is the dataset we want to manipulate
# groupby('time.month') tells xarray to group the data in notinrange by month
# mean() then tells it to calculate the average of the True/False values in each group
# plot.line() is the same command we used above, but now we've specified which subplot axis to use
notinrange.groupby('time.month').mean().plot.line(ax=axis[0])

# set title and labels
axis[0].set_title('blowdowns')
axis[0].set_ylabel('fraction of measurements')


# now do the same thing for the inrange data, plotting on the righthand subplot (axis[1])
inrange.groupby('time.month').mean().plot.line(ax=axis[1])

# set title and labels
axis[1].set_title('not blowdowns')
axis[1].set_ylabel('fraction of measurements')


### Other tips

* One thing you may see in your assignment is a "NaN" value, which means "not a number." This can show up if a calculation has no possible solution. Sometimes that means you did something wrong...but sometimes it just means that something is not physically possible. Why might that be...?

* To add a variable to an xarray dataset:

In [None]:
# save (approximate) lon and lat variable
lon = -42
lat = 60

# calculate SA (right side of equal sign) and assign it to a new field in the dataset (left side)
cf3['SA'] = gsw.SA_from_SP(cf3['salinity'], cf3['pressure'], lon, lat)

# display summary of dataset to make sure this worked
cf3