## Check / Prepare data for algae blooms identification.


The idea is to take the modeled data and train a machine learning (ML) model on that data, then try to use on the observational data.
The reason - models can't predict very well the exact time and location of algae blooms but they reproduce the physics/biogeochemistry of it.
Thus, the intuition to check is that a ML model trained on modelled data will be able to predict blooms on observational data.

In [None]:
import glob
from pathlib import Path

import numpy as np  # noqa: F401
import pandas as pd  # noqa: F401
import xarray as xr
import matplotlib.pyplot as plt  # noqa: F401

from blooms_ml.utils import (
    extract_stations_rho,
    extract_stations_u,
    extract_stations_v,
    merge_edges_to_centers,
    append_rho_profiles,
    plot_variable,
)

There is the output of hydrophysical+biogeochemical model of the Hardangerfjord at HPC FRAM.
The files are very huge to download, so I have just mounted a data folder to use them.
This is based on the ROMS hydrophysical and NERSEM biogeochemical models.
Diagnostic files have data about PAR (photosynthetically active radiation).
'Average' files have the rest of the variables.

In [None]:
files_dia = sorted(glob.glob(f"{Path.home()}/fram_shmiak/ROHO800_hindcast_2007_2019_v2bu/roho800_v2bu_dia/*dia*.nc"))
files_avg = sorted(glob.glob(f"{Path.home()}/fram_shmiak/ROHO800_hindcast_2007_2019_v2bu/roho800_v2bu_avg/*avg*.nc"))
last_n_files = 3

In [None]:
ds_dia = xr.open_mfdataset(files_dia[-last_n_files:])

In [None]:
st_labels = ['1', '2', '3', '4']
xis = [100, 120, 140, 160]
etas = [100, 110, 120, 130]

In [None]:
p = ds_dia.mask_rho.isel(ocean_time=-1).plot(
    x="xi_rho", y="eta_rho", figsize=(14, 7), cmap='GnBu'
    )
p.axes.scatter(x=xis, y=etas, color='red')
for i, label in enumerate(st_labels):
    p.axes.annotate(label, (xis[i], etas[i]), color='red')

Extract light from the chosen points.

In [None]:
ds = extract_stations_rho(ds_dia, xis, etas)
ds = merge_edges_to_centers(ds)
df_dia = ds[['light_PAR0', 'P1_netPI']].to_dataframe()

Extract other variables.
There are too many variables, let's take only some of them.

In [None]:
vars = ['lat_rho', 'lon_rho', 'ocean_time', 's_rho',
        'TotChl', 'P1_c',
        'swradWm2',
        'rho', 'temp', 'salt', 'AKv', 'u', 'v', 'w',
        'N1_p', 'N3_n', 'N5_s', 'O2_o']

In [None]:
ds_avg = xr.open_mfdataset(files_avg[-last_n_files:])
ds_rho = extract_stations_rho(ds_avg, xis, etas)
ds_rho = ds_rho.drop_dims(['eta_u', 'eta_v', 'eta_psi', 'xi_u', 'xi_v', 'xi_psi' ])
ds_u = extract_stations_u(ds_avg, xis, etas)
ds_u = ds_u.drop_dims(['eta_rho', 'eta_v', 'eta_psi', 'xi_rho', 'xi_v', 'xi_psi' ])
ds_v = extract_stations_v(ds_avg, xis, etas)
ds_v = ds_v.drop_dims(['eta_rho', 'eta_u', 'eta_psi', 'xi_rho', 'xi_u', 'xi_psi' ])
ds = xr.merge([ds_rho, ds_u, ds_v])

In [None]:
ds = merge_edges_to_centers(ds)
ds_subset = ds.drop_vars([var for var in ds.variables if var not in vars])
df = ds_subset.to_dataframe()

In [None]:
df['light_PAR0'] = df_dia['light_PAR0']
df['P1_netPI'] = df_dia['P1_netPI']

In [None]:
df

Visualization

In [None]:
df_station = df.loc[df.index.get_level_values('station') == 3]
df_station = df_station.reset_index()

In [None]:
df_station

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='light_PAR0').iloc[::-1])

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='P1_c').iloc[::-1])

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='P1_netPI').iloc[::-1])

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='w').iloc[::-1])

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='rho').iloc[::-1])

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='N1_p').iloc[::-1])

In [None]:
plot_variable(df_station.pivot(index='s_rho', columns='ocean_time', values='N3_n').iloc[::-1])

Data preprocessing.
Using equation of state it is possible to recover density form temperature and salinity.
Extract and append rho profiles.

In [None]:
df_input = df.drop(columns=['lon_rho', 'lat_rho', 'temp', 'salt', 'u', 'v', 'O2_o', 'AKv'])
df_input

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

In [None]:
df_input['rho'] = scaler.fit_transform(df_input['rho'].values.reshape(-1,1))

In [None]:
df_input = append_rho_profiles(df_input)
df_input

In [None]:
# no light, no blooms, remove dark points, it will help with scaling and sampling
df_input = df_input[df_input['light_PAR0'] > 10].reset_index(drop=True)
df_input

Scaling

In [None]:
df_input = df_input.drop(columns=['ocean_time', 's_rho'])
df_input

In [None]:
df_input.iloc[:, :8] = scaler.fit_transform(df_input.iloc[:, :8].values)
df_input

In [None]:
df_input.describe()