*05 Aug 2025, Julian Mak (whatever with copyright, do what you want with this)

### As part of material for OCES 5303 "AI and Machine Learning in Ocean Science" delivered at HKUST

For the latest version of the material, go to the public facing [GitHub](https://github.com/julianmak/OCES5303_ML_ocean) page.

---------------------------
# Some notes about the bacteria data

For this you need the file `PIDweekly_env_data.txt`, which contains some data supplied by [Charmaine Yung](https://www.charmaineyung.com/) from one of the papers she was on ([this one](https://www.nature.com/articles/ismej20174), have a quick read for some context). The [book by Jorge Sarmiento and Niki Gruber](https://lbdiscover.ust.hk/bib/991012846667203412) (online access available for HKUST members, and there are also copies available in the library) is probably also an excellent reference for the ocean biogeochemistry variables that will show up here.

The pandas code below reads the data and drops some variables, namely:
* `SampleID`, which is a tag for the collected sample.
* `Projected_Daily_Insolation`, which is the solar heating (directly related to the seasons and the water temperature).
* `MLLW`, mean lower low water (?), the low of the tide maybe.

There are on the order of 10 variables in the dataset, and you don't strictly need to look at all of them I suppose: You could for example choose to focus more on the physical variables such as temperature, salinity, and pH, or if you think the chemical things are more important then maybe oxygen, pH and dissolved inorganic carbon (DIC), or if you think the biology is more important then maybe chlorophyll, nitrates, phosphorus and silicates, or some combination thereof. There is quite a lot of freedom here.

**Possible things to investigate and do with this data**:

* Provide a brief overview of the science behind the data and describe the data.
* What do the variables and parameters mean?
* Make sure the graphs you use are labelled, has a background grid, adjust the fontsize so it is readable etc.
* You have so many variables here, it is probably worth doing some dimension reduction technique on these.
* Predict one variable from one other or multiple others.
* The data is labelled by `YearDay` but I've also provided a possibility for you to label these in terms of `Seasons` (arbitrarily defined, see below). Would (or how good would) the clustering give you the seasonality?
* Seasonality may or may not inform the regression?
* Identify clusters then use that for regression? Does that improve on skill than if you just throw all the data in?
* Dimension reduction, then cluster, then do regression?

---

In [None]:
# sample code to load the numerical models

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

data_loc = "https://raw.githubusercontent.com/julianmak/OCES4303_ML_ocean/refs/heads/main/PIDweekly_env_data.txt"

df = pd.read_csv(data_loc, sep="\s+")

In [None]:
# semi-artificially define seasons
# winter = DecJanFeb, spring = MarAprMay, summer = JunJulAug, autumn = SepOctNov

Mar = 31+28+1
Jun = Mar+31+30+31 # no +1 because Mar already has the +1
Sep = Jun+30+31+31
Nov = Sep+30+31+30

# tagging the seasons in a slightly dumb way
seasons = []
for i in range(df["YearDay"].size):
    day = df["YearDay"][i]
    if ( day % 365 >= Mar ) & ( day % 365 <  Jun ):
        seasons.append("spring")
    elif ( day % 365 >= Jun ) & ( day % 365 <  Sep ):
        seasons.append("summer")
    elif ( day % 365 >= Sep ) & ( day % 365 <  Nov ):
        seasons.append("autumn")
    else:
        seasons.append("winter")
        
df["seasons"] = seasons
df = df.drop(labels=["SampleID", "Projected_Daily_Insolation", "MLLW"], axis=1)
df.sample(10)

In [None]:
# sample code: scatter plots and labelling by seasons

fig = plt.figure(figsize=(5, 5))
ax = plt.axes()
for name in ["spring", "summer", "autumn", "winter"]:
    ax.scatter(df.loc[df["seasons"] == name]["Bacteria_abundance"], 
               df.loc[df["seasons"] == name]["Temp"], 
               label = name, zorder=2)  # force data to be between grid lines via larger zorder
ax.set_xlabel(f"bacteria abundance")
ax.set_ylabel(f"temperature")
ax.legend()
ax.grid(lw=0.5, zorder=0)               # force grid lines to be behind all data, and with thin linewidth