# World Data League 2022

## Notebook Submission Template

This notebook is one of the mandatory deliverables when you submit your solution. Its structure follows the WDL evaluation criteria and it has dedicated cells where you should add information. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work. Make sure to list all the datasets used besides the ones provided.

Instructions:
1. 🧱 Create a separate copy of this template and **do not change** the predefined structure
2. 👥 Fill in the Authors section with the name of each team member
3. 💻 Develop your code - make sure to add comments and save all the output you want the jury to see. Your code **must be** runnable!
4. 📄 Fill in all the text sections
5. 🗑️ Remove this section (‘Notebook Submission Template’) and any instructions inside other sections
6. 📥 Export as HTML and make sure all the visualisations are visible.
7. ⬆️ Upload the .ipynb file to the submission platform and make sure that all the visualisations are visible and everything (text, images, ..) in all deliverables renders correctly.


## 🎯 Challenge
*Insert challenge name here*


## Team: (Insert Team Name Here)
## 👥 Authors
* Person 1
* Person 2
* Person 3

## 💻 Development
Start coding here! 🐱‍🏍

Create the necessary subsections (e.g. EDA, different experiments, etc..) and markdown cells to include descriptions of your work where you see fit. Comment your code. 

All new subsections must start with three hash characters. More specifically, don't forget to explore the following:
1. Assess the data quality
2. Make sure you have a good EDA where you enlist all the insights
3. Explain the process for feature engineering and cleaning
4. Discuss the model / technique(s) selection
5. Don't forget to explore model interpretability and fairness or justify why it is not needed

Pro-tip 1: Don't forget to make the jury's life easier. Remove any unnecessary prints before submitting the work. Hide any long output cells (from training a model for example). For each subsection, have a quick introduction (justifying what you are about to do) and conclusion (results you got from what you did). 

Pro-tip 2: Have many similiar graphs which all tell the same story? Add them to the appendix and show only a couple of examples, with the mention that all the others are in the appendix.

Pro-tip 3: Don't forget to have a motivate all of your choices, these can be: Data-driven, constraints-driven, literature-driven or a combination of any. For example, why did you choose to test certain algorithms or why only one.

In [None]:
#Libraries
import pandas as pd
import numpy as np
import datetime
import seaborn as sns
from scipy import stats as st
import matplotlib.pyplot as plt
import plotly.express as px

#Preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Modeling & Metrics
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

**Load the Dataset**

In [None]:
# Species data
df = pd.read_csv('../data/all_species.csv')

**Data Preprocessing**

In [None]:
df['Datetime'] = df['Datetime'].astype('datetime64[ns]')
df["Weather Condition"].replace(to_replace="Sunny and Windy", value="Sunny", inplace=True)

In [None]:
numeric_features = ['Tide', 'Water temperature (ºC)', 'Sessile Coverage']
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

datetime_features = ['Month', 'Year']
categorical_features = ['Weather Condition', 'Zone']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features + datetime_features),
    ]
)

**Temporal Split for training and validation data**

In [2]:
# Adapted from https://www.rasgoml.com/feature-engineering-tutorials/scikit-learn-time-series-split

min_date = df.Datetime.iloc[0] #2011-11-28
max_date = df.Datetime.iloc[-1] #2020-11-16

# Cutoff Date: 2016-05-23
train_percent = .5
time_between = max_date - min_date
train_cutoff = min_date + train_percent*time_between

# Set index for referencing
train_df.set_index('Datetime', inplace=True)
test_df.set_index('Datetime', inplace=True)

# Set X and y
X_train = train_df[numeric_features + categorical_features + datetime_features] #(1318, 7)
y_train = train_df['Abundance (ind/m2)'] #(1318,)
X_test = test_df[numeric_features + categorical_features + datetime_features] #(561, 7)
y_test = test_df['Abundance (ind/m2)'] #(561,)

NameError: name 'df' is not defined

**Random Forest Regressor** 

In [None]:
# Create the pipeline with preprocessor and random forest regressor
forest_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('forest', RandomForestRegressor(random_state = 0))
])

# Fit the pipeline on the data
forest_pipeline.fit(X_train, y_train)

# Predict and score
forest_y_preds = forest_pipeline.predict(X_test)
mse = mean_squared_error(y_test, forest_y_preds)
mse

## 🖼️ Visualisations
Copy here the most important visualizations (graphs, charts, maps, images, etc). You can refer to them in the Executive Summary.

Technical note: If not all the visualisations are visible, you can still include them as an image or link - in this case please upload them to your own repository.

**Feature Importances**

In [None]:
# Compute importances and standard deviations
feature_names = forest_pipeline.named_steps['preprocessor'].get_feature_names_out()
importances = forest_pipeline.named_steps['forest'].feature_importances_
std = np.std([tree.feature_importances_ for tree in forest_pipeline.named_steps['forest'].estimators_], axis=0)

# Load importances into a frame 
forest_importances = pd.Series(importances, index=feature_names)
forest_importances = forest_importances.sort_values(ascending=False)

# Plot top 4 importances
fig = px.bar(forest_importances[:5], title="Feature importances using MDI")
fig.show()

**Compare actual values to predicted values**

In [None]:
# Create frame for results
actual_values = pd.DataFrame(y_test).reset_index()
actual_values = actual_values.groupby(actual_values['Datetime'].dt.year).mean()

fitted_values = pd.DataFrame(forest_y_preds, columns=['Abundance (ind/m2)'], index=y_test.index).reset_index()
fitted_values['Datetime'] = fitted_values['Datetime'].astype('datetime64[ns]')
fitted_values = fitted_values.groupby(fitted_values['Datetime'].dt.year).mean()


# Create identifiers for merging
fitted_values['Type'] = "Fitted Value"
actual_values['Type'] = "Actual Value"

# Merge frames
results = pd.concat([fitted_values, actual_values])
results.reset_index(inplace=True)

# Plot results
px.line(results, x='Datetime', y='Abundance (ind/m2)', color='Type')

## 👓 References
List all of the external links (even if they are already linked above), such as external datasets, papers, blog posts, code repositories and any other materials.

## ⏭️ Appendix
Add here any code, images or text that you still find relevant, but that was too long to include in the main report. This section is optional.


**Understanding covariates**

- Tide does not have a cyclical or seasonal pattern in the graph
    - The measure in meters of the low tide. For more information see: https://oceanservice.noaa.gov/education/tutorial_tides/tides01_intro.html
- Sea temperature has a seaonsonal patter, is it stationary? Augmented Dickey-Fuller test (ADF Test)
- Sessile Coverage needs more exploration in its patterm
    - total % covered of the sample with sessile species
- Total / Abundence droppped off at the end of 2016, what happened?
    - How many individuals of mobile species studied were found in the sample per m2

In [None]:
df_dt = df.set_index('Datetime')
df_dt[variables].plot(subplots=True,figsize = (15,15),title=variables);

**Invasive Species Overtime**
- Asparagopsis armata has an increase in coverage after the drop in abundence in 2016, is this related?
- Cladophora sp. most aggressive invasive species 
    - 'Where Cladophora becomes a pest is generally where special circumstances cause such drastic overgrowth that algal blooms develop and form floating mats. Typical examples include where hypertrophication or high mortality of rival organisms produce high concentrations of dissolved phosphorus. Extensive floating mats prevent circulation that is necessary for the aeration of deeper water and, by blocking the light, they kill photosynthesising organisms growing beneath. The mats interfere with the fishing industry by clogging nets and preventing the use of lines. Where they wash ashore the masses of rotting material reduce shoreline property values along water bodies such as the Great Lakes in the United States.[4]'

In [None]:
invasive = ['Asparagopsis armata (tufosa)',
            'Asparagopsis armata (adulta)', 
            'Osmundea pinnatifida',
            'Cladophora sp. (limo)',
            'Codium sp. (alga verde carnuda)',
            'Colpomenia sinuosa (alga bolhas)']

df_dt[invasive].plot(subplots=True,figsize=(15,15),title=invasive);

In [None]:
df_dt.groupby(by='Weather Condition')['Abundance (ind/m2)'].agg(np.mean).sort_values().plot()
df_dt.groupby(by='Cladophora sp. (limo)')[variables].agg(np.mean).plot()

In [None]:
month_group = df_dt.groupby(pd.Grouper(freq='M'))
month_ts = month_group.agg({'Tide': np.mean,
      'Weather Condition':lambda x: st.mode(x,keepdims=False)[0],
      'Water temperature (ºC)':np.mean,
      'Supratidal/Middle Intertidal':lambda x: st.mode(x,keepdims=False)[0],
      'Substrate':lambda x: st.mode(x,keepdims=False)[0],
      'Abundance (ind/m2)':np.mean})

month_ts.plot()