*30 Aug 2025, Julian Mak (whatever with copyright, do what you want with this)

### As part of material for OCES 3301 "Data Analysis in Ocean Sciences" delivered at HKUST

For the latest version of the material, go to the public facing [GitHub](https://github.com/julianmak/academic-notes/tree/master/OCES3301_data_analysis_ocean) page.

### General spiel about assessment

***Your hand in should be in the form of a Jupyter notebook and associated files (if any), and no other form of hand-ins will be accepted***. The use of Jupyter notebook and its Python component is part of the assessment criteria for the *presentation* and *coding* portion. Hand these in through Canvas in the usual way. You are graded on the following attributes:

1) **scientific content** (40%)

2) **writing, presentation and referencing** (30%)

3) **use of Jupyter and/or Python coding** (30%)

4) **orginality** (10%; analysis beyond scope of course, use of memes and puns; surprise me)

See the samples assignments I've made for the kind of things we might be expecting. We will probably be fairly loose with giving credit, but 60% or below would count as unsatisfactory (85% or above would be an A grade I would imagine).

You are allowed to use other Python packages if you find them, but see point b) below.

a) ***Late assignments get a penalty of 1% of full marks per minute*** (so don't bother handing in anything after 2 hours). We will still mark it and give feedback, but you just don't get the credit. Excuses could be entertained but you will need sufficient evidence to back this up (e.g. your internet went down in the area and you have some pictorial/written demosntration for this).

b) ***Your code needs to be able to run from scratch at least in the standard Google Colab***, otherwise you will get no marks from the 3rd attribute, and probably next to nothing in the 1st attribute (because your graphs probably won't be generating). When you hand the notebooks in, you should pass it through `Kernel -> Restarts & Clear Output`, so the file is reasonably sized and only full of text (and if you don't *you get a 10% penalty* for not following instructions, for reasons in point c) below). The procedure here is that we will run the whole notebook from scratch probably on [Google Colab](https://colab.research.google.com), then mark the resulting outputs. **So make sure you test your code through Google Colab at least!** (or do your assignments on there, find whatever work flow that works for you).

c) ***Plagiarism***: By all means consult each other and/or work together, but the files you hand in should be done and written up separately. For allowing checks in Turnitin, you should pass it through `Kernel -> Restarts & Clear Output` before you hand it in. **The default for anyone accused with plagiarism is ZERO on the assignment**, and depending on whether you decide to contest and the result of the appeal, possibly lead to an official note of plagiarism on your transcript (I will allow people to argue but one should be ready for the consequences). 

A few things count as plagiarism:

**Copying between students, and the default is that ALL parties involved get zero for the assignment**, regardless of whether the side can demonstrate they were copied from (extra incentive to keep the writing separate).

**Copying text without citation is plagiarism**. Use quotation marks and give reference if you are directly lifting text, but don't do this too often (and will result in text looking cluttered, and not getting full credit for the *presentation* aspect).

**Code is a slightly more grey area**, but I will just say no one has ever really been punished for being cautious and generous with citations, but make sure you present it well (e.g. overburdening text with citations will make the presentation ugly, and will not get full credit for the *presentation* aspect say).

I will just make the point that we don't tend to accuse plagiarism unless we have enough proof, and if we are doing it it probably means we think we have a sufficiently strong case that is probably not worth arguing against (because then penalty then gets increased).

---------------------------
# Assessment 1 (20% of total course grade)

For this assignment the data you are given is data from Argo at the surface, in the file `argo_data.csv` (if you are interested I just took a randomly selected portion of this from the big `zarr` file in the `bonus_machine_learning.ipynb` notebook). Your goal here is do some descriptive analysis of the model and explore the parameter space in the model through plotting etc. The format of this data is quite similar to `penguins.csv`. Further, a routine to calculate the density using a form of the TEOS-10 equation of state is also provided as a subroutine below (I just copy and pasted it) to do some further analysis with.

For the handed in assignment you are supposed to:

a) Look up and describe with appropriate references what Argo data and TEOS-10 equation of state is [*be able to do some background research*] (hint: some of this stuff is in the above notebook, and it's something I go through in the OCES 2003 course)

b) Describe and provide rationalisations for the summary statistics of the data depending on the basin label [*practise and demonstrate understanding of Python code and plotting*]

c) Write some of these things up and describe them using the Markdown cells [*practise and demonstrate understanding of Jupyter notebooks*]

d) Any others that could fall under originality (memes welcome, references to Miffy even better, but scientific content should always come first)

Some possibly useful basic code are provided below, use as much or as little of it as you like.

**Possible things to investigate and do** (you don't need to do all of this to get full marks):

* Zeroth order things: describe what is in the data file (how many entries? what are the variables? units?)
* Describe the summary statistics (e.g. mean, std, correlations etc.) and its variation depending on basin label.
* Plot the data in the $TS$-diagram labelled by basin.
* Compute their densities according to the TEOS-10 routine and describe the summary statistics of those.
* Plot the data in the $TS$-diagram labelled by basin and under/overlay the TEOS-10 density contours in the same plot to show the relevant density information.
* [Further computation thing] I loaded the data as pandas dataframes but manipulations are done with numpy arrays, but you could try do things directly with pandas arrays instead.
* [Further study probably] Is one variable more important than the other in contributing to the resulting density? How does this importance vary? Can you quantify this? (You could try leveraging the `gsw` package, but then you need to make sure you write in the appropriate code to load these on the fly in Colab)

<u>Note</u>: For the present purpose, you are allowed to cite Wikipedia, but you probably also want to not have Wikipedia be your only source.

***You should name your notebook "ass1_argo_STUDENTID.ipynb" when you hand the notebook in through Canvas***. When you hand in the notebook, make sure to delete all the cells above and including this one. Failure to do so may result in anything up to a ***5% deduction***, and this is ***on top of whatever deductions we may have made above for code not working*** under the **use of Jupyter and/or Python coding** category.

---

In [None]:
# sample code to load the numerical models

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# (slightly hacked) routine for computing TEOS-10 density
#   [I've forced p=0 because this is surface argo data we are dealing with]
def sigmai_dep(CT, SA, p=0):
    """
    Compute the in-situ (or potential) density from CONSERVATIVE Temperature, 
    ABSOLUTE Salinity and pressure fields using the TEOS-10 EOS

    p can be a field (computed with subroutine p_from_z say) to get in-situ 
    density or can be a number p = p_ref, then it is a potential density 
    referenced to p_ref

    Adapted from the MATLAB GSW toolbox (http://www.TEOS-10.org)

    Inputs:
    CT             = Conservative Temperature      t        deg celsius
    SA             = Absolute Salinity             s        g / kg
    p              = (reference) pressure          p        dbar

    Returns:
    sigmai_dep_out = (potential density            rho      kg / m^3

    """
    # ensures that SA is non-negative.
    SA = abs(SA)

    # deltaS = 24
    sfac = 0.0248826675584615                 # sfac   = 1/(40*(35.16504/35)).
    offset = 5.971840214030754e-1             # offset = deltaS*sfac.

    x2 = sfac * SA
    xs = np.sqrt(x2 + offset)
    ys = CT * 0.025
    z  = p * 1e-4

    v000 =  1.0769995862e-3
    v001 = -6.0799143809e-5
    v002 =  9.9856169219e-6
    v003 = -1.1309361437e-6
    v004 =  1.0531153080e-7
    v005 = -1.2647261286e-8
    v006 =  1.9613503930e-9
    v010 = -1.5649734675e-5
    v011 =  1.8505765429e-5
    v012 = -1.1736386731e-6
    v013 = -3.6527006553e-7
    v014 =  3.1454099902e-7
    v020 =  2.7762106484e-5
    v021 = -1.1716606853e-5
    v022 =  2.1305028740e-6
    v023 =  2.8695905159e-7
    v030 = -1.6521159259e-5
    v031 =  7.9279656173e-6
    v032 = -4.6132540037e-7
    v040 =  6.9111322702e-6
    v041 = -3.4102187482e-6
    v042 = -6.3352916514e-8
    v050 = -8.0539615540e-7
    v051 =  5.0736766814e-7
    v060 =  2.0543094268e-7
    v100 = -3.1038981976e-4
    v101 =  2.4262468747e-5
    v102 = -5.8484432984e-7
    v103 =  3.6310188515e-7
    v104 = -1.1147125423e-7
    v110 =  3.5009599764e-5
    v111 = -9.5677088156e-6
    v112 = -5.5699154557e-6
    v113 = -2.7295696237e-7
    v120 = -3.7435842344e-5
    v121 = -2.3678308361e-7
    v122 =  3.9137387080e-7
    v130 =  2.4141479483e-5
    v131 = -3.4558773655e-6
    v132 =  7.7618888092e-9
    v140 = -8.7595873154e-6
    v141 =  1.2956717783e-6
    v150 = -3.3052758900e-7
    v200 =  6.6928067038e-4
    v201 = -3.4792460974e-5
    v202 = -4.8122251597e-6
    v203 =  1.6746303780e-8
    v210 = -4.3592678561e-5
    v211 =  1.1100834765e-5
    v212 =  5.4620748834e-6
    v220 =  3.5907822760e-5
    v221 =  2.9283346295e-6
    v222 = -6.5731104067e-7
    v230 = -1.4353633048e-5
    v231 =  3.1655306078e-7
    v240 =  4.3703680598e-6
    v300 = -8.5047933937e-4
    v301 =  3.7470777305e-5
    v302 =  4.9263106998e-6
    v310 =  3.4532461828e-5
    v311 = -9.8447117844e-6
    v312 = -1.3544185627e-6
    v320 = -1.8698584187e-5
    v321 = -4.8826139200e-7
    v330 =  2.2863324556e-6
    v400 =  5.8086069943e-4
    v401 = -1.7322218612e-5
    v402 = -1.7811974727e-6
    v410 = -1.1959409788e-5
    v411 =  2.5909225260e-6
    v420 =  3.8595339244e-6
    v500 = -2.1092370507e-4
    v501 =  3.0927427253e-6
    v510 =  1.3864594581e-6
    v600 =  3.1932457305e-5

    v = v000 + ( 
        xs * (v100 + xs * (v200 + xs * (v300 + xs * (v400 + xs * (v500 
      + v600 * xs))))) + ys * (v010 + xs * (v110 + xs * (v210 + xs * (v310 
      + xs * (v410 + v510 * xs)))) + ys * (v020 + xs * (v120 + xs * (v220 
      + xs * (v320 + v420 * xs))) + ys * (v030 + xs * (v130 + xs * (v230 
      + v330 * xs)) + ys * (v040 + xs * (v140 + v240*xs) + ys * (v050 
      + v150 * xs + v060 * ys))))) + z * (v001 + xs * (v101 + xs * (v201 
      + xs * (v301 + xs * (v401 + v501 * xs)))) + ys * (v011 + xs * (v111
      + xs * (v211 + xs * (v311 + v411 * xs))) + ys * (v021 + xs * (v121 
      + xs * (v221 + v321 * xs)) + ys * (v031 + xs * (v131 + v231 * xs) 
      + ys * (v041 + v141 * xs + v051 * ys)))) + z * (v002 + xs * (v102 
      + xs * (v202 + xs * (v302 + v402 * xs))) + ys * (v012 + xs * (v112 
      + xs * (v212 + v312 * xs)) + ys * (v022 + xs * (v122 + v222 * xs) 
      + ys * (v032 + v132 * xs + v042 * ys))) + z * (v003 + xs * (v103 
      + v203 * xs) + ys * (v013 + v113 * xs + v023 * ys) + z * (v004 
      + v104 * xs + v014 * ys + z * (v005 + v006 * z)))))
              )

    sigmai_dep_out = (1 / v) - 1000

    return sigmai_dep_out

In [None]:
# read the data
option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    path = "argo_csv.csv"
elif option == "remote":
    print("loading data remotely")
    path = "https://raw.githubusercontent.com/julianmak/OCES3301_data_analysis/refs/heads/main/argo_data.csv"
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

df = pd.read_csv(path)
df

In [None]:
# subsetting data out by basin label, and loading these into numpy arrays
data_NA = df[df["basin"] == "N_Atlantic"]
TS_NA = data_NA[["temperature_degC", "salinity_g/kg"]].values
TS_NA

In [None]:
# density calculator: sigmai_dep(temperature, salinity)
dens_NA = sigmai_dep(TS_NA[:, 0], TS_NA[:, 1])
dens_NA