*updated 23 Jul 2025, Julian Mak (whatever with copyright, do what you want with this)

### As part of material for OCES 4303 "AI and Machine Learning in Marine Science" delivered at HKUST

For the latest version of the material, go to the public facing [GitHub](https://github.com/julianmak/OCES4303_ML_ocean) page.


---
# 01: Python recap and data handling

The course here will use [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) through [Jupyter notebooks](https://jupyter.org/). I chose Python because:
* It's free (i.e. not [MATLAB](https://en.wikipedia.org/wiki/MATLAB) or SPSS)
* It's not [R](https://en.wikipedia.org/wiki/R_(programming_language)) (I hate R syntax personally, but it is very a powerful tool)
* Python is used widely, has a lot of packages built in, pretty mature with userbase and support (an appropriate Google search will most of the time help debugging)
* Familiarity to me

I will openly admit I do not write Python in a Pythonic way: I started on MATLAB until MATLAB screwed up vector graphics outputs for me, so I rage quit and went to Python. The code provided here is certainly not the cleanest way to do it (this is sometimes by design), nor is it the most idiomatic way of doing it, but it should (mostly?) work and do the intended thing.

## <span style="color:red">!!! NOTE !!!</span> 

The content here has OCES 3301 as a pre-requisite, and largely assumes familiarity with Python and some of the associated packages (see the list of packages to be loaded). The course content is available at https://github.com/julianmak/OCES3301_data_analysis for reference purposes.

---
# 1. Recapping Python through various data handling

What it says on the title. Just going to load and do basic manipulations of data previously considered in OCES 3301, as well as some new ones that will be used for demonstration purposes. 

> ## Key Objective(s)
> 1. The present notebook is to check you can in fact load (almost) all the data that will be used for the course. If you are having trouble now then it really should to be fixed (e.g. ask for help), because there will be issues for the remaining content...
> 2. Demonstrates some Python loading/plotting approaches.
> 3. Demonstrates some manipulations of array data, as well as `pandas` and `xarray` dataframes.

Going to load a bunch of relevant libraries first.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import xarray as xr

---
## a) Numerical data: python generated

Name of the game for most of this course is to turn the data we read into numbers, manipulate those into a form that the Python data science + Machine Learning packages understands, and then feed them in. So it's probably useful to start with those.

Below is a simple example of the function
\begin{equation*}
    f = \sin(t)
\end{equation*}
modified in various ways (this is the same example used in `07_time_series` of OCES 3301). I am going to generate an ***array*** of numbers for $t$, and this is then fed into a function that spits out another array.

> Note: For displaying to screen I will mostly use ***fstrings*** (e.g. `f"STUFF"` with the preface `f` before the string marks `" "`), although occasionally I will use `r` instead if I need some specific formatting.

In [None]:
t_vec   = np.linspace(0, 2.0 * np.pi, 31)
f       =   np.sin(t_vec)
f_pos   = 2*np.sin(t_vec)
f_neg   =  -np.sin(t_vec)
f_shift =   np.sin(t_vec - np.pi / 2.0)

print(f"t_vec = {t_vec}")
print(" ")
print(f"f = {f}")
print(" ")

The above as shown is not hugely useful since it's just a dump of numbers. We can visualise this accordingly. Left plot shows it as a function of time (e.g. $[f_0, f_1, \ldots]$ against $[t_0, t_1, \ldots]$), right plots shows it as one function plotted against another (e.g. $[f_0, f_1, \ldots]$ against $[g_0, g_1, \ldots]$).

In [None]:
# award winning graph

# 2x4 grid, firsrt graph takes up a 2x2 space located first at the upper left corner (0, 0)
fig = plt.figure(figsize=(10, 3))
ax = plt.subplot2grid((2, 4), (0, 0), colspan=2, rowspan=2)
ax.plot(t_vec, f    ,   "C0", label=r"$f$")
ax.plot(t_vec, f_pos,   "C1", label=r"$f^+$") 
ax.plot(t_vec, f_neg,   "C2", label=r"$f^-$")
ax.plot(t_vec, f_shift, "C3", label=r"$f_{\rm shift}$")
ax.set_xlabel(r"$t$")
ax.set_ylabel(r"$f$")
ax.grid()
ax.legend()

# subsequent graphs are 1x1 but with a change in the location
ax = plt.subplot2grid((2, 4), (0, 2))
ax.scatter(f, f, color="C0")
ax.set_xlabel(r"$f$")
ax.set_ylabel(r"$f$")
ax.grid()

ax = plt.subplot2grid((2, 4), (0, 3))
ax.scatter(f, f_pos, color="C1")
ax.set_xlabel(r"$f$")
ax.set_ylabel(r"$f^+$")
ax.grid()

ax = plt.subplot2grid((2, 4), (1, 2))
ax.scatter(f, f_neg, color="C2")
ax.set_xlabel(r"$f$")
ax.set_ylabel(r"$f^-$")
ax.grid()

ax = plt.subplot2grid((2, 4), (1, 3))
ax.scatter(f, f_shift, color="C3")
ax.set_xlabel(r"$f$")
ax.set_ylabel(r"$f_{\rm shift}$")
ax.grid()

fig.tight_layout(pad=1.0) # give the graph a bit of padding

> <span style="color:red">**Q.**</span> This was previously used to demonstrate lag (linear) correlations. Convince yourself that the correlations of the right hand side subplots are (going clockwise) 1, 1, 0 and -1 as they should be, corresponding accordingly to what you would suspect from looking at the entries in the left hand side subplot. Convince yourself that the **lagged correlation** looks like a sine (or cosine) curve.

Can do stuff for multi-dimension data, but going to do this together with reading data from files.

---
## b) Reading numerical data from file

<img src="https://i.imgur.com/rKcpZzr.jpg" width="400" alt='cursed panda'>

I am going to rely on `pandas` or `xarray` to read the data provided from file (or remotely via an internet connection if the files are small enough), and then do some manipulations and/or plotting with these; these can be done in principle via other means (see OCES 3301 for example).

## El Nino 3.4 data

This is a text file shown in the lecture slides and is just a text file. Going to call this remotely and spit our the contents.

In [None]:
option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    path = "elnino34_sst.data"
elif option == "remote":
    print("loading data remotely")
    path = "https://raw.githubusercontent.com/julianmak/OCES4303_ML_ocean/refs/heads/main/elnino34_sst.data"
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

data = pd.read_csv(path)
data

Generally pandas ***tries*** to read things assuming sensible layout etc., but that can fail if the data is not cleaned up (and most data is uncleaned for your exact purpose). Here we have headers and misc. things we don't really need, leading to things being read into a single block. 

It is generally a good idea to have a look at the raw data file first to see what it consists and anticipate what things you might need to do. In this case, optional arguments needs to be provided (e.g. delimiter, separator, etc...)

In [None]:
# can give it a few more details to make it easier for pandas to help us
df = pd.read_csv(path,
          sep='\s+',     # this used to be delim_whitespace=True,
          names=["year", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"],
          skipfooter=4,  # chop out some lines
          skiprows=1,    # chop out some unnecessary lines
          false_values=-99.99,
          engine="python")
df = df.replace(-99.99, np.nan) # replace missing values with NaNs (not a number)
df = df.set_index("year")       # sets the index to be the year column
df

Notice there are `NaN`s in the data, which itself is not an issue. The pandas dataframe `df` has `72 rows x 12 columns`, which is 72 years of data every month over 12 months. We can do a quick dirty plot of the data.

In [None]:
# not an award winning graph

fig = plt.figure(figsize=(8, 3))
ax = plt.axes()
df.plot(ylabel=r"${}^\circ\mathrm{C}$", ax=ax)  # pass some keywords in
ax.grid()

This case it's treating `months` (as the header) as a category, i.e. plotting all the January temperatures as a function of `year`. This isn't necessary what we want: we probably want it as a time-series with increasing time.

There are ways to do the reshaping in pandas, but for demonstration I am going to do this in native `numpy`. I am going to:

1. load the pandas `df` frame into a numpy array with `df.values`.
2. I want to keep the column ordering but remove the rows (e.g. have 1st row of 12, then 2nd row of 12 etc.), to be done via `.reshape(SIZE)`
3. above can be done with `.flatten()` also
4. I am additionally going to compute the linear ***line of best fit*** (cf. `07_time_series` in OCES 3301), which I will need if I want to detrend the data to get the ***anomalies*** (although I don't use it here)

> <span style="color:red">**Q.**</span> by default `.reshape()` and `.flatten()` both use `order=C` as a default option, which gives the right thing in this case. Try this with `order=F` and convince yourself that is the wrong thing to do (the positions of `NaN`s would help).

In [None]:
data = df.values
print(f"data has shape {data.shape}")
data = data.reshape(data.size)  # data.size gives the total the number of entries
print(f"data now has shape {data.shape} after reshape or flatten")
print(" ")

# line of best fit via polyval (only because I don't want to load scipy)
# need to remove NaNs first
data_dum = data[~np.isnan(data)]  # find the NOT NaNs
p = np.polyfit(np.arange(len(data_dum)), data_dum, 1)

fig = plt.figure(figsize=(8, 3))
ax = plt.axes()
ax.plot(data, label="data")
ax.plot(np.polyval(p, np.arange(len(data_dum))), label="LOBF")
ax.set_xlabel("index")
ax.set_ylabel(r"${}^\circ\mathrm{C}$")
ax.grid()
ax.legend();

> <span style="color:red">**Q.**</span> I was lazy and didn't provide the time array. Create a time array and do a proper regression (be careful of units). Then you can get a magnitude of a global warming trend from `p[0]` (which is the gradient of the straight line).
>
> <span style="color:red">**Q.**</span> Do the above but with `scipy` or `sklearn`.

See extended exercises for more things to do.

## Penguin data

<img src="https://www.boredpanda.com/blog/wp-content/uploads/2020/08/cats-standing-like-penguins-fb-png__700.jpg" width="500" alt='cursed penguins'>

The [Palmer Penguins](https://cran.r-project.org/web/packages/palmerpenguins/readme/README.html) data was compiled as a replacement/alternative to the standard [iris data](https://en.wikipedia.org/wiki/Iris_flower_data_set) because of racism/eugenics reasons of Ronald Fisher (look it up if you are interested). A mildly touched up version is given here as `penguins.csv` (or https://raw.githubusercontent.com/julianmak/OCES4303_ML_ocean/refs/heads/main/penguins.csv; I removed some columns and some `NaN`s). We are going to be using this dataset quite a bit.

In [None]:
option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    path = "penguins.csv"
elif option == "remote":
    print("loading data remotely")
    path = "https://raw.githubusercontent.com/julianmak/OCES4303_ML_ocean/refs/heads/main/penguins.csv"
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

df = pd.read_csv(path)
df

So this one is a text file but notice the headers are also loaded and are in fact useful because it tells you the data entries and also the units (which is more than can be said for a lot of data...) Notice also the species column is text while others are numbers; we will end up converting the species entries to numerical values in due course.

Zeroth step of data analysis/exploration is to actually plot out the data first. Here are some random things I thought that could be done to demonstrate plotting/visualising and calling of things from `pandas`.

In [None]:
# histograms of one of the randomly chosen variables cycling the species

target_vars = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
target_var = np.random.choice(target_vars)

fig = plt.figure(figsize=(6, 3))
ax = plt.axes()

for species in df["species"].unique():   # pick out all unique entries under `species`
    ax.hist(df[df["species"] == species][target_var], label=species, alpha=0.8)

ax.set_xlabel(f"{target_var}")
ax.set_ylabel("frequency")
ax.grid()
ax.legend();

In [None]:
# do a 3d plot of three random choices of variables

from mpl_toolkits import mplot3d  # load a package for 3d plots

target_vars = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
target_var = np.random.choice(target_vars, 3, replace=False)  # 3 unique choices

fig = plt.figure(figsize=(6, 6))
ax = plt.axes(projection="3d")

for species in df["species"].unique():   # pick out all unique entries under `species`
    ax.scatter(df[df["species"] == species][target_var[0]], 
               df[df["species"] == species][target_var[1]],
               df[df["species"] == species][target_var[2]],
               label=species
               )
ax.set_xlabel(f"{target_var[0]}")
ax.set_ylabel(f"{target_var[1]}")
ax.set_zlabel(f"{target_var[2]}")
ax.grid(lw=0.5, zorder=0)
ax.legend()
ax.view_init(25, -45)

> <span style="color:red">**Q.**</span> In the histogram plot I was deliberately lazy and didn't provide pre-defined bin edges (so `ax.hist` ended up choosing it). Pre-define the bin edges and do the binning of the data so that data from every species uses the same bins. You may want to use `np.histogram` instead, and then throw the outputs from there into `ax.hist` accordingly.
>
> <span style="color:red">**Q.**</span> The histogram plot shows frequency for now, but turn that into a probability.
>
> <span style="color:red">**Q.**</span> Explore other combinations of scatter plots in both 2d and 3d.

## Gridded data

By gridded I mean these are data that have pre-defined co-ordinates that sits on a grid, such as (longitude, latitude) or similar. An example of this is satellite data: initially the data is per swarth, but given enough swarths some filling in can be done and the data put on a regular grid that is more useful for end users. One ongoing application of ML in oceanography would be instead of waiting for enough swarths, maybe you could use ML to fill in the gaps instead.

The one I am showing here is from a simulation (sample from [NEMO ORCA0083-N01](https://gws-access.jasmin.ac.uk/public/nemo/)). The original dataset is REALLY big, so I downsized it quite significantly. The data is in the [NetCDF](https://en.wikipedia.org/wiki/NetCDF) format which is supposed to be self-describing (the one I made is not quite that). Going to open this with `xarray`.

In [None]:
# would do "local" for this one, because the filesize is not small (~45 MB)

# could do this once and for all (as long as you save the file) with
# !wget https://github.com/julianmak/OCES4303_ML_ocean/raw/refs/heads/main/current_speed.nc

import fsspec  # for caching the file if using remote option

option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    file = "current_speed.nc"
elif option == "remote":
    # do a local caching (downloads a file to cache)
    print("loading data remotely")
    file_loc = "https://github.com/julianmak/OCES4303_ML_ocean/raw/refs/heads/main/current_speed.nc"
    file = fsspec.open_local(f"simplecache::{file_loc}", filecache={'cache_storage': '/tmp/fsspec_cache'})
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

df = xr.open_dataset(file)
df

This one is `(time, lat, lon)`, and although I didn't write the units in it is `m s-1`. Here we can select one time and plot out the data as a map.

In [None]:
# using contourf here, could do pcolor also
t_ind = 0

fig = plt.figure(figsize=(5, 4))
ax = plt.axes()
cs = ax.contourf(df["lon"], df["lat"], df["speed"][t_ind, :, :], 31)
ax.set_xlabel(r"lon $(^\circ)$")
ax.set_ylabel(r"lat $(^\circ)$")
ax.set_title(df["time"][0].values)  # load as string to remove other xarray descriptors
cax = plt.colorbar(cs)
cax.ax.set_title(r"$\mathrm{m}\ \mathrm{s}^{-1}$");

> <span style="color:red">**Q.**</span> Plot out longitudinal or meridional slices instead.
>
> <span style="color:red">**Q.**</span> Select one location and plot out the time series.
>
> <span style="color:red">**Q.**</span> Make a movie out of the data.

## Argo data

[Argo](https://argo.ucsd.edu/data/) is a system of autonomous floats that are put into the ocean, floating around with the currents, and periodically does vertical sections to take in-situ measurements of things like temperature, salinity, pressure, and so forth down to about 2000 m depth; see below for the schematic. There are increasing interest in [BGC-Argo](https://biogeochemical-argo.org/) that measure quantities relevant to biogeochemistry, and [deep Argo](https://argo.ucsd.edu/expansion/deep-argo-mission/) that go down to 4000 m. See OCES 3301 for more description.

<img src="https://argo.ucsd.edu/wp-content/uploads/sites/361/2020/06/float_cycle_1-768x424.png" width="600" alt='Argo'>

> NOTE: The namesake of argo is related to the [JASON](https://en.wikipedia.org/wiki/Jason-1) satellites if you know your Greek mythology.

The float data here are vertical sections at specific locations of space, and can be regarded as data that is more "raw" than the gridded data. The data provided here is in the `zarr` formatt which can be opened with `xarray` also. 

Some care needs to be taken to obtain a copy of this. Would highly recommend not loading this remotely, because the content is quite big.

> <span style="color:red">!!! NOTE !!!</span> (JM 15 Apr 2025): If you are on Colab, you probably need to mount and do a separate upload of the data.
> 1) Go to https://drive.google.com/drive/folders/1JJ0cpshu6-JE8wp93UsHuqy6V33rQy7s?usp=sharing
> 2) Download the folder
> 3) Upload that to your own instance of Colab
> 4) Mount with `from google.colab import drive; drive.mount('/content/drive')` and then proceed as below

In [None]:
# data is slightly out-of-date and will fail with "consolidated" option but will load
# silencing the warning
df = xr.open_zarr("./GLOB_HOMOGENEOUS_variables.zarr/", consolidated=False)
df

So note that the data is arranged as `(N_PROF, DEPTH)`, and `TIME`, `LATITUDE` and `LONGITUDE` are the variables tagged to `N_PROF`. Sample plot looks like this.

In [None]:
# plot out what the observation data actually looks like in geographical space

nl = 20 # change this index to plot different depths (as an index entry)

fig = plt.figure(figsize=(14, 4))

# temperature
ax = plt.subplot(1, 2, 1)
cs = ax.scatter(df.LONGITUDE, df.LATITUDE, 10, df.TEMP[:,nl], 
                cmap=plt.get_cmap('Spectral_r'), zorder=3)
ax.set_xlabel(r"lon ($^\circ$)")
ax.set_ylabel(r"lat ($^\circ$)")
plt.colorbar(cs)
ax.grid(lw=0.5, zorder=0)
ax.set_title(f"Temp at {df.DEPTH[nl].values} m")

# salinity
ax = plt.subplot(1, 2, 2)
cs = ax.scatter(df.LONGITUDE, df.LATITUDE, 2, df.PSAL[:,nl], 
                alpha=.5, cmap=plt.get_cmap('viridis', 5), zorder=3)
ax.set_xlabel(r"lon ($^\circ$)")
ax.set_ylabel(r"lat ($^\circ$)")
plt.colorbar(cs)
ax.grid(lw=0.5, zorder=0)
ax.set_title(f"Salinity at {df.DEPTH[nl].values} m")

In [None]:
# plot out TS-diagrams at different depths

nl = 0

fig = plt.figure(figsize=(10, 6))
ax = plt.subplot(1, 2, 1)
ax.plot(df.PSAL[:, nl], df.TEMP[:, nl], "o", markersize=2, label="total")
ax.grid()
plt.legend()
ax.set_ylabel(r'Temperature ($^\circ\ \mathrm{C}$)')
ax.set_xlabel(r'Salinity ($\mathrm{g}/\mathrm{kg}$)')
ax.set_title(f"TS diagram at {df.DEPTH[nl].values} m")

nl = 20

ax = plt.subplot(1, 2, 2)
ax.plot(df.PSAL[:, nl], df.TEMP[:, nl]  , "o", markersize=2, label="total")
ax.grid()
plt.legend()
ax.set_ylabel(r'Temperature ($^\circ\ \mathrm{C}$)')
ax.set_xlabel(r'Salinity ($\mathrm{g}/\mathrm{kg}$)')
ax.set_title(f"TS diagram at {df.DEPTH[nl].values} m");

> <span style="color:red">**Q.**</span> So from the TS diagram you notice that are some outliers that probably should be removed (e.g. water that is too fresh is unlikely under typical oceanic conditions). Come up with criteria to drop these points from `df`, and do the plots again. This is an important part of data pre-processing before throwing it into the ML algorithmcs, following the ***Garbage In Garbage Out*** principle.
>
> <span style="color:red">**Q.**</span> Subset these by geographical locations (e.g. Atlantic Ocean using whatever defensible criterion you like), and either plot these separately, or plot them together with the labels.
>

We will come back to this dataset later (e.g. `02` for data cleaning and scaling, `04` for clustering and later ones for doing interpolation and/or gap filling.

---

## c) Images as numerical data

Just like we can visualise data as an image, we can sometimes go the other way and get data out of an image:

1) One possible example might be that you have chlorophyll concentration measurements, which gives some shades of green in the image. Then a useful thing might be the reverse: you can consider the case of measuring greeness from a satellite to infer for the chlorophyll concentration.
2) I want to automatically classify species of fish or penguins or whatever from a long video segment.
3) I have broken images that I may want to fill out.

I am going to use some `jpg` files I have handy to demonstrate images as arrays.

In [None]:
option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    file = "broccollie.jpeg"
elif option == "remote":
    # do a local caching (downloads a file to cache)
    print("loading data remotely")
    file_loc = "https://github.com/julianmak/OCES4303_ML_ocean/raw/refs/heads/main/broccollie.jpeg"
    file = fsspec.open_local(f"simplecache::{file_loc}", filecache={'cache_storage': '/tmp/fsspec_cache'})
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

data = plt.imread(file)
ax = plt.axes()
ax.imshow(data)
ax.set_title(f"a broccolie of shape {data.shape}");

As can seen from querying the loaded array, the array is of size `(pixels, pixels, RGB)` where `RGB` is the strength of (red, green blue), and this goes from 0 to 255 for reasons you can look up if you want.

It's just an array so all the usual things can be done to it. The example below converts the RGB image to grayscale using the formula
\begin{equation*}
    \mathrm{gray} = 0.299 \mathrm{Red} + 0.587 \mathrm{Green} + 0.114 \mathrm{Blue}.
\end{equation*}
The formula assumes the RGB values lie between 0 and 1 so some conversion is needed, but that's easy (just divide by 255).

> NOTE: For plotting you could of course just plot is as grayscale, but this is for demonstrating how to manipulate arrays.

In [None]:
# normalise data then convert
data_bw = data / 255
data_bw = 0.288 * data_bw[:, :, 0] + 0.587 * data_bw[:, :, 1] + 0.114 * data_bw[:, :, 2]

ax = plt.axes()
ax.imshow(data_bw, cmap="gray")
ax.set_title(f"a broccollie of shape {data_bw.shape}");

> <span style="color:red">**Q.**</span> Consider passing these through blurring or sharpening fitlers. `scipy` has a few, or you can do it from scratch by specifying the convolution or deconvolution kernels accordingly (cf. `08_times_series` and `10_fun_with_maps` in OCES 3301).

Below case is a stack of images that are written into a `csv` file, where one dimension denotes all the pixels, and the other dimension denotes the sample number.

In [None]:
option = "remote"

if option == "local":
    print("loading data locally (assumes file has already been downloaded)")
    path = "cat.csv"
elif option == "remote":
    print("loading data remotely")
    path = "https://raw.githubusercontent.com/julianmak/OCES4303_ML_ocean/refs/heads/main/cat.csv"
else:
    raise ValueError("INVALID OPTION: use 'remote' or 'local'")

# going to transpose this so the shape is (image number, pixels)
df = pd.read_csv(path, header=None).T
df

It turns out this is 80 images of cats, of 64 by 64 pixels each ($64^2 = 4096$). This one requires unflattening/reshaping the arrays for the images to make sense. For ML applications with actually feed in flattened data into algorithms; see later notebooks.

In [None]:
# load data, reshape data and then plot one of the guys out

cats = df.values
ind = np.random.randint(cats.shape[0])

# transpose back for image display purposes
fig = plt.figure(figsize=(2, 2))
ax = plt.axes()
ax.imshow(np.reshape(cats[ind, :], (64, 64)).T, cmap="gray")
ax.set_title(f"cat {ind+1} / {cats.shape[0]}");

> <span style="color:red">**Q.**</span> As a python exercise, try and use a loop to plot five of these but randomly choosing the image number to plot. Make sure the randomly selected indices are distinct.
>
> <span style="color:red">**Q.**</span> You can try reshaping it in different ways (e.g. instead of `.reshape(64, 64)` try other numbers that multiple to `4096`), and convince yourself the choice taken here is the only sensible one.
>
Point here is that if you can deal with these sample images you can in principle deal with other images (e.g. fish/coral/snails/satellite/remote sensing images).

----------------
# More involved exercises with this notebook

## 1) Delay embedding (Takens' embedding)

Taking the sine curve example, consider doing the plots like the right hand side, but instead do e.g.
\begin{equation*}
    [f_1, f_2, f_3, f_4, \ldots, f_N]
\end{equation*}
against
\begin{equation*}
    [f_0, f_1, f_2, f_3, \ldots, f_{N-1}],
\end{equation*}
i.e. shift the array by one index (or more if you want). You will need to be careful about array sizes and calling array entries that are less than zero (this will lead to wrap around, e.g. `f_{-1} = f_{N}` and `f_{-2} = f_{N-1}`; the sine curve example is periodic so it doesn't matter, but it might do for more general cases). 

You may want to consider writing this a subroutine that takes in an array and spits out two arrays, one with a shifted index, and both having the correct size.

What you should get are ellipses with different eccentricities depending on the time-lag. Convince yourself that makes sense.

The above is related to [Takens' theorem](https://en.wikipedia.org/wiki/Takens%27s_theorem) and we may or may not come back to this in the bonus lectures, e.g. [Empirical Dynamic Modelling (EDM)](https://en.wikipedia.org/wiki/Empirical_dynamic_modeling), [Topological Data Analysis (TDA)](https://en.wikipedia.org/wiki/Topological_data_analysis).

In [None]:
# baseline arrays for doing lag embedding with
t_vec = np.linspace(0, 2.0 * np.pi, 61)
f     = np.sin(t_vec)

## 2) El Nino data manipulation

Starting from the El Nino 3.4 time series data, probably detrend it to get the anomalies. Provide a threshold criteria to classify El Nino and La Nina events using an analogous criterion to e.g. https://ggweather.com/enso/oni.htm. You may or may not want to compute some smoothing / running averages.

Could also compute the power spectrum or similar to obtain the magnitude of frequencies.

## 3) Turtle + penguin data

#### (Probably relevant to one of the assignments.) 

Have a look and see what is in this dataset: https://www.kaggle.com/datasets/abbymorgan/penguins-vs-turtles. This may be used for one of the marked assignments.

## 4) Satellite data

#### (May be relevant to one of the assignments.) 

Obtain some satellite grid and/or track data. Would recommend having a look at the [Sentinel Data Space](https://dataspace.copernicus.eu/data-collections/copernicus-sentinel-data); we may also use this in one of the worked examples and/or marked assignments.