# Species Distribution Model (SDM)

#### Intro
As marine ecosystems undergo global climate change, there is an increasing need to incorporate potential shifts in the distribution of marine taxa into management plans

We propose that the best models—those that effectively describe or predict marine animal distribution patterns at a desired temporal scale without utilizing unnecessarily high-resolution data—are obtained when the temporal characteristics of the animals’ distribution and environmental data sufficiently match the scale of the ecological question and the variability of the ecosystem

#### Data & Variability
*Instantaneous covariates* represent the state of the environment in close proximity to the animal (i.e., within its direct perceptual range) at the moment it was observed.

*Contemporaneous covariates* represent the state of the environment in a time window (typically days to months)

**This dataset is identified as:**
*Climatological datasets* divide the calendar year into shorter time slices such as days, weeks, months or seasons, and for each slice, apply a summary statistic (e.g., mean, variance, frequency or probability) to many (often at least 10) years of observations made during that slice to estimate the long-term state

Are contemporaneous covariates are necessary or climatological covariates are sufficient to model these associations?

we acknowledge that current sampling abilities and requirements lead to imperfect models and potentially biased predictions and practical recommendations are critically needed for ecologists and managers.

SDMs developed from climatological covariates are relevant for static management and used to predict important species habitats with a high potential for delineation of marine-protected areas and implementation of mitigation measures

#### Scale of the ecological question
Macroscale (Figure 4)
animal distribution: Fisheries, historical catches, species coverage
enviro data: sum of in situ databases, climatological oceanographic covariates
temporal resolution: climatological 

#### Outcome
Torres et al. (2013) modelled the seasonal distribution of southern right whales (Eubalaena australis) by comparing the predicted habitat suitability maps with maps of shipping traffic to identify areas of increased risk of collision where mitigation measures could be implemented.

[Source](https://onlinelibrary.wiley.com/doi/full/10.1111/ddi.12609)

#### Methods and Techniques (250 words max)
Process-based modeling
Global change includes changes in climate, habitat connectivity and nutrient dynamics at various spatial and temporal scales.

the model should allow an exploration of how these changes affect outcomes. 

Changing the scale of a process can alter the relative importance of key drivers, or disrupt the process altogether.

Complex simulation models can be process-based, but a highly dimensional model will be difficult to analyze. As the number of estimated parameters increases, the size of the parameter space (i.e., the number of possible combinations of parameter values) increase

#### Social Impact
The protection and restoration of coastal wetlands can be more cost effective than barrier construction as a means to reduce storm damage (Halpern et al. 2007, Costanza et al. 2008, although see Francis et al. 2011).

[Source](https://esajournals.onlinelibrary.wiley.com/doi/10.1890/ES12-00178.1)

[Source](https://daniel-furman.github.io/Python-species-distribution-modeling/)

#### Data (250 words max)
Special attention should be put on any scaling mismatches, meaning cases where the spatial (or temporal) grain or extent doe not match between biodiversity and environmental data or within environmental data. In these cases, we need to make decisions about adequate **upscaling and downscaling strategies.**

**absence data** are rarely available. In such cases, adequate background data or pseudo-absence data needs to be selected.

for later model assessment we may wish to partition the data into training data and validation data (Hastie, Tibshirani, and Friedman 2009)

## Import Libraries

In [136]:
#Libraries
import pandas as pd
import numpy as np
import datetime
from scipy.stats.stats import pearsonr

#Preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Modeling & Metrics
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error

# Data Visualizations
import matplotlib.pyplot as plt

  from scipy.stats.stats import pearsonr


## Load the Dataset

In [131]:
# Species data
df = pd.read_csv('all_species.csv')

In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1949 entries, 0 to 1948
Columns: 158 entries, Datetime to Abundance (ind/m2)
dtypes: float64(153), object(5)
memory usage: 2.3+ MB


In [133]:
df.head()

Unnamed: 0,Datetime,Tide,Weather Condition,Water temperature (ºC),Zone,Supratidal/Middle Intertidal,Substrate,Chthamalus sp.,Balanus perforatus,Patella sp.,...,Callionymus lira (peixe-pau lira),Oncidiella celtica,Doriopsilla areolata (nudibrânquio),Scorpaena sp. (Rascasso),Lipophrys pholis (ad.),Diplodus cervinus,Gobiusculus flavescens,Sessile Coverage,Total Mobile Species,Abundance (ind/m2)
0,11/28/2011 10:10:00,0.6,Clear sky,16.0,D,Medium,Puddle,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,66.45,4.0,0.05
1,11/28/2011 10:25:00,0.6,Clear sky,16.0,D,Medium,Rock,8.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.1,2.0,0.1
2,11/28/2011 10:40:00,0.6,Clear sky,16.0,D,Medium,Rock,25.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67.2,1.0,0.0
3,11/28/2011 11:00:00,0.6,Clear sky,16.0,E,Medium,Rock,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.2
4,11/28/2011 11:15:00,0.6,Clear sky,16.0,E,Medium,Rock,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,48.25,0.0,0.0


In [None]:
df[['Tide', 'Water temperature (ºC)', 'Sessile Coverage', 'Total Mobile Species','Abundance (ind/m2)']].describe()

**Initial Observations**
- `Tide` has a nearly equal mean and median with a majority of values spread within 2 (TODO: How was tide measured?), indicating a possible normal distribution.
- `Water temperature (ºC)` may have a similar distribution to `Tide`. Are observed min and max values related for these features due to an event?
- `Sessile Coverage` may need to be plotted to confirm if the distribution is normal. Is there a time factor, like seasonality?
- `Total Mobile Species` and related field `Abundance(ind/m2)` has a relatively large range of sample values. Double check that thes values appear to be correlated.

**Note**: This evaluation is not only to determine the shape of the distribution, as all numeric columns are transformed by removing the mean value of each feature, then scaling it using SciKit Learn's Preprocessing library.

In [141]:
#calculation correlation coefficient and p-value between x and y
print(pearsonr(df['Tide'], df['Abundance (ind/m2)']))
print(pearsonr(df['Water temperature (ºC)'], df['Abundance (ind/m2)']))
print(pearsonr(df['Sessile Coverage'], df['Abundance (ind/m2)']))

PearsonRResult(statistic=-0.03604420427423636, pvalue=0.1116643146390885)
PearsonRResult(statistic=0.2280404088616998, pvalue=2.075160366851023e-24)
PearsonRResult(statistic=0.02102188575759382, pvalue=0.35363046536318116)


indicates shifts in water temperature are correlated with mobile species abundance
sensitive to small changes
reduce precision

Do I need to fix?:

severity? VIF below 5

goals? doesnt influence predictions or goodness of fit

structural so scale

or use lasso and ridge regression

[Source](https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/)


## Data Preprocessing

### Building a Pipeline

In [None]:
numeric_features = ['Tide', 'Water temperature (ºC)', 'Sessile Coverage']
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

categorical_features = ['Weather Condition']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

### Train/Test split

In [137]:
# Define X and y
X = df[numeric_features + categorical_features]
y = df['Abundance (ind/m2)']

In [147]:
X["Weather Condition"].value_counts()

Clear sky        1385
Cloudy            345
Sunny             103
Rain               83
Fairly Cloudy      33
Name: Weather Condition, dtype: int64

Since only 9 Sunny and Windy collapse with Sunny

In [146]:
X["Weather Condition"].replace(to_replace="Sunny and Windy", value="Sunny", inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["Weather Condition"].replace(to_replace="Sunny and Windy", value="Sunny", inplace=True)


In [148]:
X["Weather Condition"].value_counts()

Clear sky        1385
Cloudy            345
Sunny             103
Rain               83
Fairly Cloudy      33
Name: Weather Condition, dtype: int64

In [None]:
# Perform train-test split without shuffling
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False, test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

## Model Selection

#### Linear Regressor

In [None]:
# Create the pipeline with preprocessor and linear regressor
linear_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('linear', LinearRegression())
])

# Fit the pipeline on the data
linear_pipeline.fit(X_train, y_train)

#### Random Forest Regressor

In [None]:
# Create the pipeline with preprocessor and random forest regressor
forest_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('forest', RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 18))
])

# Fit the pipeline on the data
forest_pipeline.fit(X_train, y_train)

### Model Evaluation
Evaluate the trained models using appropriate metrics such as mean squared error (MSE) and mean absolute error (MAE).  Compare the performance of different models.

In [None]:
linear_y_preds = linear_pipeline.predict(X_test)
mean_squared_error(y_test, linear_y_preds)

In [None]:
# Predict and score
forest_y_preds = forest_pipeline.predict(X_test)
mean_squared_error(y_test, forest_y_preds)

### Feature Importances

In [None]:
import eli5

In [None]:
# Extract encoded feature names and append them to the known list of numerical features
onehot_columns = list(forest_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(input_features=categorical_features))
numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)

In [None]:
target_names = y_test.unique().astype(str)

In [None]:
eli5.explain_weights(forest_pipeline.named_steps['forest'], top=3, feature_names=numeric_features_list)

[Source](https://towardsdatascience.com/extracting-feature-importances-from-scikit-learn-pipelines-18c79b4ae09a)

### Final Model and Predictions
Select the best-performing model based on evaluation metrics, retrain it on the full training data, and make predictions on the test set. Evaluate the final model's performance and interpret the results.

### Conclusion
Summarize your findings, discuss any insights gained from the analysis, and suggest future steps for improvement if applicable.