## Introduction to Python for Physical Oceanography

There are a lot of useful python tutorials out there, but few that really focus on the tools and types of workflows that are commonly employed when doing data analysis in physical oceanography ([this lecture](https://www.youtube.com/watch?v=ZyCkVI7k3eo) is a good exception). Here we'll go over some of the basic tools and how you might use them to load data, make figures, and do scientific analysis. More resources for learning python in general can be found [here](https://docs.python.org/3/tutorial/) and, for a more visual approach, [here](https://www.youtube.com/watch?v=rfscVS0vtbw)

### Scientific Computing

Loading the ``numpy`` package is almost always a necessity when using Python for scientific computing. It contains essential mathematical functions in addition to those that enable efficient manipulation of large arrays and matrices, and support element-wise operations, broadcasting, linear algebra, Fourier transforms, and random number generation. If you're a MATLAB user, you'll notice that many of the functions that are native to MATLAB have been replicated in Python via ``numpy``. 

In [None]:
import numpy as np

In physical oceanography, we're often looking at processes that span large enough distances that we must account for the Earth's curvature. The haversine formula calculates the great-circle distance between two points on a sphere:

$$
a = \sin^2\left(\frac{\Delta\phi}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\Delta\lambda}{2}\right)
$$

$$
c = 2 \arcsin\left( \sqrt{a} \right)
$$

$$
d = R \cdot c
$$

where:

- \( $\phi_1, \phi_2$ \) are the latitudes in radians  
- $\Delta\phi = \phi_2 - \phi_1$ is the difference in latitude  
- $\Delta\lambda = \lambda_2 - \lambda_1$ is the difference in longitude  
- $R$ is Earth’s radius (≈ 6367 km)

Here's how you would implement this formula with a Python function:


In [None]:
def haversine_np(coord1, coord2):
    """
    Calculates the great-circle distance between two points 
    on the earth (specified in decimal degrees), given as (lat, lon) tuples.

    Args:
        coord1: tuple (lat1, lon1)
        coord2: tuple (lat2, lon2)

    Returns:
        Distance in kilometers.
    """
    lat1, lon1 = np.radians(coord1)
    lat2, lon2 = np.radians(coord2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    
    km = 6367 * c  # Earth's radius in km
    return km

In [51]:
# [LIVE CODE]: calculate the great-circle distance between Accra, Ghana and Woods Hole, MA, USA

### Loading Data

In physical oceanography, we're often working with large datasets that are *geo-referenced* - arrays of physical variables like temperature and salinity in which each measurement has a latitude and longitude coordinate. Data structured this way is often packaged in "NetCDF" format (file extension ".nc") and can be read by several python packages. One of the most commonly used is **[xarray](https://xarray.dev/)**, which is a great tool for manipulating data in stored in multidimensional arrays. 

In [None]:
import xarray as xr

The data we're going to use in this workshop is hosted online at a series of URLs. Data can be directly downloaded from online servers using commands like ``wget``, which is a simple way to fetch files from the web using the command line. For example:
```bash
wget https://example.com/path/to/file.nc
```
This would download the file to your working directory. In this workshop, instead of doing a direct download we will simply *stream* the data. The URLs for all the NetCDFs are stored in the attached JSON file. (This is only slightly fancier than a plain text file in that it allows us to group like-files together, but a simple text file of URLs would also work fine). JSON files can be read with the follow command:

In [None]:
import json

with open("./data/data_manifest_January.json", "r") as f:
    manifest = json.load(f)

We can then loop through the urls in the manifest and open them in **xarray**

In [None]:
import fsspec # we need this to read the NetCDF files from URLs since xarray doesn't support reading from URLs directly

# choose data type and optional density contour flag
data_type = 'ctd'
density_contour = False

# load main datasets
datasets = []
for url in manifest.get(data_type, []):
    with fsspec.open(url, mode='rb') as f:
        ds = xr.open_dataset(f, decode_timedelta=True) # flag decode_timedelta=True to suppress warnings
        datasets.append(ds)

# load optional CTD datasets for density contour
datasets_ctd = []
if density_contour and data_type != 'ctd':
    for url in manifest.get('ctd', []):
        with fsspec.open(url, mode='rb') as f:
            ds = xr.open_dataset(f, decode_timedelta=True)
            datasets_ctd.append(ds)


In [None]:
# [LIVE CODE]: explore the datasets

### Visualizing Data

To visualize this data, we can use **matplotlib.pyplot**.

In [None]:
import matplotlib.pyplot as plt

**xarray** does have a built in plotting function that is useful for visualizing data quickly, but notice that it doesn't give us much flexibility

In [None]:
datasets[0].CT.plot(label='Temperature')

In [None]:
# [LIVE CODE]: plot a single profile temperature, salinity, chlorophyll, oxygen using matplotlib.

### Geophysical Data Analysis

Often times in physical oceanography, we're interested in tracking the density of the water since it can tell us a lot about the dynamics. Density can be calculated from readily measured seawater properties like temperature and salinity since both impact seawater density: saltier water is more dense than fresh water and cold water is more dense than warm water. But what if seawater is warm but salty, or cold but fresh? 

The relative contribution of salinity and temperature to the observed seawater density is modeled by an **equation of state**, the simpliest of which is the linear equation:

$$
\Delta \rho = \rho_0 (- \alpha \Delta T + \beta \Delta S)
$$

in which:
- $\rho$ is the seawater density (kg/m³)
- $\rho_0$ is a reference density (often around 1025 kg/m³)
- $\alpha$ is the **thermal expansion coefficient** (1/°C), and is always positive
- $\beta$ is the **haline contraction coefficient** (1/psu), also positive
- $T$ is temperature (°C)
- $S$ is salinity (psu)


The international standard is a nonlinear polynomial with *many* terms and is implemented in full by the [Gibbs SeaWater (GSW) Toolbox of TEOS-10](https://teos-10.org/pubs/gsw/html/gsw_contents.html). We can use the [Python implementation](https://teos-10.github.io/GSW-Python/intro.html) here.

In [None]:
import gsw
# [LIVE CODE]: calculate potential density from conservative temperature and absolute salinity. 
# plot density
# plot N^2

Here's an example plotting function that combines all the tools we've used so far: basic scientific computing, oceanographic analysis, and data visualization techniques. It loops through each of the NetCDF files we've loaded in, and uses the gsw toolbox to calculate the potential density from the conservative temperature and absolute salinity, takes the mean and standard deviation across all profiles, and then plots them in 3 subplots. 

We have to import a custom function from `utils.py` that handles interpolation for data with NaNs ("Not-a-Number"s) since the code will otherwise throw an error. 

In [None]:
from utils import * 

max_depth = 80
depth_grid = np.linspace(10, max_depth, 50)  # define a common depth grid

# initialize arrays to store interpolated values
CT_interp = np.full((len(datasets), len(depth_grid)), np.nan)
SA_interp = np.full((len(datasets), len(depth_grid)), np.nan)
sigma_interp = np.full((len(datasets), len(depth_grid)), np.nan)

# interpolate each profile onto the common depth grid
for i, ds in enumerate(datasets):
    p = ds["P"].values
    ct = ds["CT"].values
    sa = ds["SA"].values
    
    p = interpolate_nans(p, np.arange(len(p)))
    ct = interpolate_nans(ct, np.arange(len(ct)))
    sa = interpolate_nans(sa, np.arange(len(sa)))
    
    # interpolate to the common depth grid
    CT_interp[i, :] = np.interp(depth_grid, p, ct, left=np.nan, right=np.nan)
    SA_interp[i, :] = np.interp(depth_grid, p, sa, left=np.nan, right=np.nan)

    # compute potential density for each profile
    sigma_interp[i, :] = gsw.density.sigma0(SA_interp[i, :], CT_interp[i, :])

# compute the mean profile and std
CT_mean = np.nanmean(CT_interp, axis=0)
CT_std  = np.nanstd(CT_interp, axis=0)

SA_mean = np.nanmean(SA_interp, axis=0)
SA_std  = np.nanstd(SA_interp, axis=0)

sigma_mean = np.nanmean(sigma_interp, axis=0)
sigma_std  = np.nanstd(sigma_interp, axis=0)


# create figure axes
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(7, 6))

# plot mean profiles with variance visualized as shaded areas
ax1.plot(CT_mean, depth_grid, 'b', linewidth=2, label="Mean")
ax1.fill_betweenx(depth_grid, CT_mean - CT_std, CT_mean + CT_std, color='b', alpha=0.2, linewidth=0)

ax2.plot(SA_mean, depth_grid, 'r', linewidth=2, label="Mean")
ax2.fill_betweenx(depth_grid, SA_mean - SA_std, SA_mean + SA_std, color='r', alpha=0.2, linewidth=0)

ax3.plot(sigma_mean, depth_grid, 'purple', linewidth=2, label="Mean")
ax3.fill_betweenx(depth_grid, sigma_mean - sigma_std, sigma_mean + sigma_std, color='purple', alpha=0.2, linewidth=0)


# formatting axes
ax1.set_ylabel('Pressure (dbar)') # 'Pressure (dbar)', 'Depth (m)'
ax1.set_xlabel('Conservative Temperature (°C)')
ax1.set_ylim(12, max_depth)
ax1.set_xlim(15, 30)

ax2.set_xlabel('Absolute Salinity (g/kg)')
ax2.set_yticks([])
ax2.set_ylim(12, max_depth)
ax2.set_xlim(34.5, 36.5)

ax3.set_xlabel('Potential Density (kg/m$^3$)')
ax3.set_yticks([])
ax3.set_ylim(12, max_depth)
ax3.set_xlim(21, 28)

ax1.invert_yaxis()
ax2.invert_yaxis()
ax3.invert_yaxis()

fig.suptitle(f"January Sakumono CTD Profiles (Average and Std Dev)", fontsize=14)
plt.tight_layout()
plt.show()

Instead of looking at all of our data as a function of depth, sometimes it's useful to look at the data in "temperature-salinity (T-S) space". In this case, our $x$-axis is salinity, our $y$-axis is temperature, and each data point is plots with respect to both those properties. It can help us learn things about the source of our water, particularly since deep ocean water is formed in very specific places on Earth.

Here's an example from some Argo float data (float 7580/WMO\#3902237) collected off the coast of South America! 

![T-S_diagram](./figures/t-s-diagram.png)

Let's try to make a T-S diagram for the data from the data we've downloaded. Like in the previous figure, we'll need to declare our figure axes and then loop through each dataset to collect the temperature, salinity and temperature. There are many different ways to do it! 

Some useful functions are [``plt.scatter``](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) and [``plt.contour``](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contour.html). 

In [None]:
# make a T-S diagram!