*01 Feb 2025, Julian Mak (whatever with copyright, do what you want with this)

### As part of material for OCES 3301 "Data Analysis in Ocean Sciences" delivered at HKUST

For the latest version of the material, go to the public facing [GitHub](https://github.com/julianmak/academic-notes/tree/master/OCES3301_data_analysis_ocean) page.

### General spiel about assessment

***Your hand in should be in the form of a Jupyter notebook and associated files (if any), and no other form of hand-ins will be accepted***. The use of Jupyter notebook and its Python component is part of the assessment criteria for the *presentation* and *coding* portion. Hand these in through Canvas in the usual way. You are graded on the following attributes:

1) **scientific content** (40%)

2) **writing, presentation and referencing** (30%)

3) **use of Jupyter and/or Python coding** (30%)

4) **orginality** (10%; analysis beyond scope of course, use of memes and puns; surprise me)

See the samples assignments I've made for the kind of things we might be expecting. We will probably be fairly loose with giving credit, but 60% or below would count as unsatisfactory (85% or above would be an A grade I would imagine).

You are allowed to use other Python packages if you find them, but see point b) below.

a) ***Late assignments get a penalty of 1% of full marks per minute*** (so don't bother handing in anything after 2 hours). We will still mark it and give feedback, but you just don't get the credit. Excuses could be entertained but you will need sufficient evidence to back this up (e.g. your internet went down in the area and you have some pictorial/written demosntration for this).

b) ***Your code needs to be able to run from scratch at least in the standard Google Colab***, otherwise you will get no marks from the 3rd attribute, and probably next to nothing in the 1st attribute (because your graphs probably won't be generating). When you hand the notebooks in, you should pass it through `Kernel -> Restarts & Clear Output`, so the file is reasonably sized and only full of text (and if you don't *you get a 10% penalty* for not following instructions, for reasons in point c) below). The procedure here is that we will run the whole notebook from scratch probably on [Google Colab](https://colab.research.google.com), then mark the resulting outputs. **So make sure you test your code through Google Colab at least!** (or do your assignments on there, find whatever work flow that works for you).

c) ***Plagiarism***: By all means consult each other and/or work together, but the files you hand in should be done and written up separately. For allowing checks in Turnitin, you should pass it through `Kernel -> Restarts & Clear Output` before you hand it in. **The default for anyone accused with plagiarism is ZERO on the assignment**, and depending on whether you decide to contest and the result of the appeal, possibly lead to an official note of plagiarism on your transcript (I will allow people to argue but one should be ready for the consequences).

A few things count as plagiarism:

**Copying between students, and the default is that ALL parties involved get zero for the assignment**, regardless of whether the side can demonstrate they were copied from (extra incentive to keep the writing separate).

**Copying text without citation is plagiarism**. Use quotation marks and give reference if you are directly lifting text, but don't do this too often (and will result in text looking cluttered, and not getting full credit for the *presentation* aspect).

**Code is a slightly more grey area**, but I will just say no one has ever really been punished for being cautious and generous with citations, but make sure you present it well (e.g. overburdening text with citations will make the presentation ugly, and will not get full credit for the *presentation* aspect say).

I will just make the point that we don't tend to accuse plagiarism unless we have enough proof, and if we are doing it it probably means we think we have a sufficiently strong case that is probably not worth arguing against (because then penalty then gets increased).

---------------------------
# Assessment 4 (25% of total course grade)

(So none of the individual components of this assignment are very hard by itself, but there are quite a lot of steps involved, so see suggested work flow below.)

Here we are going to reproduce some of the El-Nino diagnostics using the ***full*** ERSST monthly SST data (**don't use the anomaly file**) in the equatorial Pacific from the period between years 1910 to 2010 (the choice of upper limit is rather arbitrary, but the reason for 1910 will be hinted at later).

There are two things I would minimally want to see, and doing these well will be enough to give you an "A" grade:

1) Regenerate something like the El-Nino 3.4 time-series (cf. `elnino34_sst.data`) and analyse it [to test on using xarray and numpy to process spatially varying data, repeating time-series analysis, cross-validating previous analysis]

2) Compute the EOFs associated with El-Nino, analyse the resulting PCs [as above, but also to test on EOF analysis]

To help you along, you might consider doing the below *in order*. For concreteness, just do the following for the data between 1910 to 2010.

### Suggested (minimal) things you should do:

* xarray, cartopy, PSD, detrending, numpy/xarray functions dealing with NaNs

* use xarray to read the full ERSST data, slice out the data associated with the equatorial Pacific, and use Cartopy to do plotting where appropriate (see some of the code below to get you started)

* for the El-Nino 3.4 task, you want the average SST over the El-Nino 3.4 region
  * you should look up what the El-Nino 3.4 region corresponds to in terms of longitude and latitude), and just take a simple average in this case to get the appropriate time-series
  * you should detrend over the period, but you probably don't need to do a rolling average of the time-series data
  * for a Fourier analysis to pick out periods, you need the time elapsed to rescale your frequencies accordingly; see `08_time_series` and code below for some of that
  * the `time` data is given in files themselves
  * compare it with the results you got for analysing `elnino34_sst.data`
  
* the EOF one is similar but there are more pre-processing steps involved
  * do `sst.sel(lon=slice(120, 300), lat=slice(-30, 30))` first, which will select data between the specified lon/lat locations, but *do not subset time* for now
  * you should identify all the land points to exclude (since these are NaNs), and do an spatial average that exclude NaNs to get the time series of domain-averaged SST (cf. above for El-Nino 3.4 but for a bigger domain)
  * now, you probably want to do a rolling average of the resulting domain-averaged SST; try something like `sst_rolling = sst_subset.rolling(time=12, center=True).mean(skipna=True).dropna("time", how="all")`, which is an average over 12 entries (so window over 1 year), centered, average over the window, and drop the time entries where every spatial point are NaNs, which occur at the edges of the time-series
  * *then* you subset that averaged time-series to get the entries you care about as something like `sst_rolling = sst_rolling.sel(time=slice(1910, 2010))`, which throws out the NaNs at the edges of your data
  * compute the linear trend of the time-series for detrending purposes
  * now select the full SST data, do a rolling average of it as above, and then detrend it using the computed linear trend for every point
    * a `for` loop would be much safer
    * you could try and rely on *broadcasting* via `(sst_rolling - lin_trend[:, np.newaxis, np.newaxis])`, to get element-wise subtraction of `lin_trend[t]` from `sst_rolling[t, lat, lon]` given the dimension mismatch
  * proceed to do an EOF analysis on it as usual, being careful about only including the wet points (mostly copy and pasting from `10_fun_with_maps`)
  * the PCs will not be in sensible units because the data is not normalised, but since you are only going to picking periods from the power spectrum it doesn't really matter
    * you may or may not want to window or smooth out the resulting PC a little bit
  
To help you along a bit, the EOFs and PCs I got after doing those steps are given below:

<img src="https://i.imgur.com/8pZB73d.png" width="800" alt='elnino'>

The 1st EOF explains almost 50% of the variance and is the standard El-Nino signal (note that massive tongue in the equatorial Pacific), while the 2nd EOF is what is called El-Nino Modoki (note the V-shape pattern looks like `<` and is blue in this case), and explains maybe about 10% of the variance. You should cross check the patterns I got, provide references, and look up a bit on the background, differences and similarities between standard El-Nino and El-Nino Modoki, and say a bit on why we might care about those climate modes of variability.

### Assessment key points

Things you should do here (and the intention behind the tasks):

a) some background reading/research into El-Nino [*be able to do some background research*]

b) demonstrate ability in processing and plot geophysical data using Python tools, via xarray, cartopy, scipy and numpy as appropriate [*data processing, data analysis, Python competence*]

c) understanding of data analysis tools so far and how to put them together [*Python competence, seeing how tools fit together*]

d) interpretation of results [*referencing, using data analysis to back up statements*]

e) write some of these things up and describe them using the Markdown cells [*practise and demonstrate understanding of Jupyter notebooks*]

f) any others that could fall under originality (additional considerations below, memes welcome, references to Miffy even better; scientific content should always come first)
  
### Additional things you might consider doing

* what if you don't detrend and/or do the rolling averages?
  * you will probably get what's called a dipole pattern centered over the Equator, what would that correspond to?
  * you might also get one that is warm or cold everywhere, what would that correspond to?
  * are the associated periods what you expected?
  * comment on the changes to the variance explained

* what if you take a larger domain (e.g. extending more in the Pacific, over the whole globe)?
  * try [this one](https://www.mbari.org/science/upper-ocean-systems/biological-oceanography/global-modes-of-sea-surface-temperature/) or [Wills et al. 2018](https://atmos.washington.edu/~rcwills/papers/2018_Wills_LFCA.pdf)
  * comment on the changes to the variance explained

* what if you include the years before 1910?
  * you should try and plot out the time-series first, which will give you a inkling of what is going and a warning on why you want to be careful with detrending

* animate the EOFs via multiplying it by the PCs

***You should name your notebook "ass4_elnino_STUDENTID.ipynb" when you hand the notebook in through Canvas***. When you hand in the notebook, make sure to delete all the cells above and including this one. Failure to do so may result in anything up to a ***5% deduction***, and this is ***on top of whatever deductions we may have made above for code not working*** under the **use of Jupyter and/or Python coding** category.

For this one, I would like you to hand a final version of the notebook that loads the data via **remote** means. The data here is not that small, so you might want to download it and then load the file locally for speed reasons (I might change it to the local load option when I mark it for speed reasons, so I suppose you can assume I will have access to a version of the data I provide a link to below).

---

In [None]:
# basic packages you might want to use

import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import fsspec
from scipy import signal
import scipy
from sklearn.decomposition import PCA
import datetime

# some modules for cartopy (I used 0.18.0 apparently, from "cartopy.__version__")
try:
  import cartopy
except ModuleNotFoundError:
  !pip install cartopy
  import cartopy

import cartopy.crs as ccrs
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
from mpl_toolkits.axes_grid1 import make_axes_locatable

### Given code

In [None]:
# for downloading the data if you need it
# !wget https://github.com/julianmak/OCES3301_data_analysis/raw/main/ersstv5_sst.nc

option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    file = "ersstv5_sst.nc"
elif option == "remote":
    # do a local caching (downloads a file to cache)
    print("loading data remotely")
    file_loc = "https://github.com/julianmak/OCES3301_data_analysis/raw/refs/heads/main/ersstv5_sst.nc"
    file = fsspec.open_local(f"simplecache::{file_loc}", filecache={'cache_storage': '/tmp/fsspec_cache'})
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

df = xr.open_dataset(file)
df

In [None]:
# demonstrating some cartopy and xarray commands

pcarree = ccrs.PlateCarree()

target_lon, target_lat = slice(120, 300), slice(-30, 30)


# trivial selection to drop the z co-ordinate (there is only 1)
sst = df["sst"].isel(lev=0).sel(lon=target_lon, lat=target_lat)

# pull out other useful things to carry around
lon  = df["lon"].sel(lon=target_lon).values
lat  = df["lat"].sel(lat=target_lat).values

fig = plt.figure(figsize=(10, 4))
ax = plt.axes(projection=ccrs.PlateCarree(central_longitude=180.0))
sst.isel(time=0).plot(ax=ax, transform=pcarree, cmap="RdBu_r")
gl = ax.gridlines(crs=pcarree, draw_labels=True,
                  linewidth=2, color='gray', alpha=0.5, linestyle='--')
gl.top_labels = False
gl.right_labels = False
ax.add_feature(cartopy.feature.LAND, zorder = 10, edgecolor = 'k')

In [None]:
# plot GLOBAL sst mean and demonstration of creating artificial time array and manipulating python datetime64
sst_mean = sst.mean(dim=["lat", "lon"], skipna=True)
sst_rolling = sst_mean.rolling(time=12, center=True).mean(skipna=True).dropna("time", how="all")

# rolling introduces NaNs on the outer edges, so going to throw those away by subsetting touched up data in time
target_t = slice("1910", "2010")  # select analysis period
sst_mean = sst_mean.sel(time=target_t)
sst_rolling = sst_rolling.sel(time=target_t)

# get linear trend
time = sst_mean["time"]  # when turning into `np.float32` this is in units of nanoseconds for some reason...
# the time interval seems to be in days / months
# hence the converted time unit needs to be in days / month to prevent data loss
# choose 'Day' here and np.float64
time_in_day = time.values.astype('M8[D]').astype(np.float64)
# reference date = the first time in data
ref_date =  abs(time_in_day[0])
time_in_day = time_in_day + ref_date
start_day, end_day = time_in_day[0], time_in_day[-1]
# linear regression using the original time array
p = np.polyfit(time_in_day, sst_mean, deg=1)
lin_trend = np.polyval(p, time_in_day)

fig = plt.figure(figsize=(10, 4))
ax = plt.axes()
ax.plot(time, sst_mean, alpha=0.7)
ax.plot(time, sst_rolling, 'k--')
ax.plot(time, lin_trend, "C3--")
ax.set_ylabel('Temperature')
ax.set_xlabel('Year')
ax.grid()