# 1. Model Training

This notebook includes the training of the sea level prediction models witha baseline linear model and a 2-hidden-layer non-linear model. It performs the following steps:
1. Loads and preprocesses the historical GHG and sea level data.
2. Splits the data into training, validation, and test sets using a **chronological** split, which is appropriate for time-series data.
3. Initializes two models: a baseline linear model and a 2-hidden-layer non-linear model.
4. Trains both models on the training data.
5. Saves the trained model objects to the `../models/` directory for later use in analysis and prediction.

### 1.1 Setup and Data Preparation

We utilized the processed data from the data_exploration notebook and applied normalization. Subsequently, we created the training data by pairing multi-year GHG emission inputs with corresponding historical sea level rise values. We initially selected a 15-year input timespan, recognizing the inherent time delay between GHG emissions and their impact on sea level rise. This delay is primarily due to the cumulative effect of emissions on global temperature, which drives long-term processes such as the melting of ice sheets and glaciers over decades. To prepare our model for predicting future sea level changes based on projected GHG emission scenarios, we refrained from shuffling the data and instead implemented a chronological split into training, validation, and test sets. This ensures the test and validation sets represent entirely unseen data, preventing data leakage and allowing for a realistic evaluation of the model's predictive capabilities. We maintained a standard split of 70% for training, 15% for validation, and 15% for testing.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import os

In [None]:
%%capture
from ipynb.fs.full.data_exploration import df_sealevel, GHG_past_comb, df_pred

In [None]:
# Add src directory to path to import neural_networks module
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from neural_networks import NeuralNetwork_0hl, NeuralNetwork_2hl

# Set random seed for reproducibility
np.random.seed(42)

# Create models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Prepare the data for normalization
df_pred = df_pred.rename(columns={'Trend from implemented policies (Lowest bound of  red shading ) ': 'Trend from implemented policies','Limit warming to 2°C (>67%) or return warming to 1.5°C (>50%) after a high overshoot, NDCs until 2030 (Median , dark navy blue line )': 'Limit warming to 2°C or return warming to 1.5°C after a high overshoot', 'Limit warming to 2°C (>67%) (Median , dark green line )': 'Limit warming to 2°C', 'Limit warming to 1.5°C (>50%) with no or limited overshoot ( Median ligh blue line ) ': 'Limit warming to 1.5°C'})
df_sealevel = df_sealevel.groupby(df_sealevel.Day.dt.year).mean()
df_sealevel = df_sealevel.drop('Day', axis=1, errors='ignore')

# Normalization
GHG_past_norm = (GHG_past_comb - GHG_past_comb.mean()) / GHG_past_comb.std()
sealevel_norm = (df_sealevel - df_sealevel.mean()) / df_sealevel.std()

# Sequence and Splitting
def get_GHG_sequence(n_years, df_GHG, start_year, end_year):
    X, y = list(), list()
    for i in range(start_year, end_year + 1):
        end_ix = i - 1
        start_ix = end_ix - n_years + 1
        seq_x = df_GHG.loc[start_ix:end_ix]
        X.append(seq_x.to_numpy())
        y.append(sealevel_norm.loc[i].values)
    return np.array(X), np.array(y)

timespan = 15
train_end_year = 2000
validation_end_year = 2007
test_end_year = 2014

X_train, y_train = get_GHG_sequence(timespan, GHG_past_norm, 1970, train_end_year)
X_val, y_val = get_GHG_sequence(timespan, GHG_past_norm, train_end_year + 1, validation_end_year)
X_test, y_test = get_GHG_sequence(timespan, GHG_past_norm, validation_end_year + 1, test_end_year)

print(f'Training set size: {len(X_train)}')
print(f'Validation set size: {len(X_val)}')
print(f'Test set size: {len(X_test)}')

Training set size: 31
Validation set size: 7
Test set size: 7


### 1.2 Train Baseline Model (Linear)

In [None]:
print('--- Training Baseline Model ---')
nn_base = NeuralNetwork_0hl(input_size=timespan, output_size=1)
mse_base_train = nn_base.train(np.squeeze(X_train), y_train, epochs=20000, learningrate=0.001, print_output=False)
print('Training complete.')

# Save the model
nn_base.save_model('../models/baseline_model.pkl')
print('Baseline model saved to ../models/baseline_model.pkl')

--- Training Baseline Model ---
Training complete.
Baseline model saved to ../models/baseline_model.pkl


### 1.3 Train 2-Hidden-Layer Model (Non-Linear)

In [None]:
print('--- Training 2-Hidden-Layer Model ---')
nn_2hl = NeuralNetwork_2hl(input_size=timespan, hidden_size1=8, hidden_size2=4, output_size=1)
mse_2hl_train = nn_2hl.train(np.squeeze(X_train), y_train, epochs=20000, learningrate=0.001, print_output=False)
print('Training complete.')

# Save the model
nn_2hl.save_model('../models/2hl_model.pkl')
print('2-layer model saved to ../models/2hl_model.pkl')

--- Training 2-Hidden-Layer Model ---
Training complete.
2-layer model saved to ../models/2hl_model.pkl
