# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error



## Model Choice

The baseline model is a simple sequential model with two layers. The first layer has 32 neurons with a Rectified Linear Unit (ReLU) activation function and takes input with a shape of (3,). The second layer has a single neuron, making it suitable for regression tasks, and it employs a linear activation function. 


This model is designed to predict species richness based on three input features. The architecture, particularly the number of layers, units, and activation functions, can be further adjusted depending on the specific requirements and characteristics of the dataset.

## Feature Selection
Seed setting is done to remove randomness in the model training


Based on correlation matrix analysis, following features and target variables are selected for the regression:

features = 'Age', 'elevation', 'Mean_Temperature'

target = 'SR_total'

In [None]:
# Feature selection
#
#Setting seed (to avoid randomness in the model convergence)
SEED = 42
# Set seed for NumPy
np.random.seed(SEED)
# Set seed for TensorFlow
tf.random.set_seed(SEED)


# Setting Features and target variable (Age, pCO2, temp... are used to predict SR values)
features = maxprob[['Age', 'elevation', 'Mean_Temperature']]
target = maxprob['SR_total']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#Normalize the data (to bring them into comparable numerical values)
def normalize(data):
    return (data - data.mean()) / data.std()

X_train = normalize(X_train)
X_test = normalize(X_test)


## Implementation

Loss function used is Mean Squared Error (MSE)
- MSE is a differentiable function, which is crucial for the optimization algorithm to
perform gradient descent during training.
- Minimizing MSE during training corresponds to minimizing the squared differences
between predicted and actual values. Which is a common optimization objective in regression
tasks.




Evaluation metric used is R squared

- As our primary goal is to have a model that explains and predict a certain percentage of the variance in the target variable.
- It represents the proportion of the dependent variable's variance captured by the model.

In [None]:

model_baseline = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(3,)),
    tf.keras.layers.Dense(1)  # Output layer with one neuron for species richness prediction
])

# model compilation
def r_squared(y_true, y_pred):
    SS_res = tf.reduce_sum(tf.square(y_true - y_pred))
    SS_tot = tf.reduce_sum(tf.square(y_true - tf.reduce_mean(y_true)))
    return 1 - (SS_res / (SS_tot + tf.keras.backend.epsilon()))

#model_baseline.compile(optimizer='adam', loss='mean_squared_error', metrics=[r_squared])
model_baseline.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), loss='mean_squared_error', metrics=[r_squared])

# Train the model
history = model_baseline.fit(X_train, y_train, epochs=40, batch_size= 32, validation_split=0.2, verbose=1)


# TO check & visualize if the model is overfitting
# Plot training and validation loss
plt.figure(figsize=(10, 5))

# Plot training & validation loss values
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()



## Evaluation



In [None]:

# Evaluate the baseline model on test data
loss, mae = model_baseline.evaluate(X_test, y_test, verbose=0)
#loss, mae = model.evaluate(X_test_scaled, y_test, verbose=0)

#print(f'Baseline Model Test MAE: {mean_squared_error}')
#print(f'Baseline Model Test r2: {r_squared}')


from sklearn.metrics import mean_squared_error, r2_score

# Predictions from the baseline model (already obtained)
predictions_baseline = model_baseline.predict(X_test)

# Calculate additional metrics
mse = mean_squared_error(y_test, predictions_baseline)
# rmse = np.sqrt(mse)
# mae = mean_absolute_error(y_test, predictions_baseline)
r2 = r2_score(y_test, predictions_baseline)

print(f'MSE: {mse}')
print(f' Baseline model R-squared: {r2}')