In this notebook, I will...
   - calculate baseline error
   - run random forest regressors on abundance 
   - load regional dfs, impute, add island col as ID, rbind all 5 dfs 
   - predict out for each group 

#### Scoring models

We evaluated the performance of the abundance models based on mean average error (accuracy) and mean absolute percent error (MAPE).

## Import modules

In [44]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, confusion_matrix, auc
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import *
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report
import pickle
import requests
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [14]:
os.chdir('C:/Users/linds/OneDrive/Documents/samoa_corals_data')

## Import data

In [15]:
coral_types=['scler','branching','columnar','encrusting','free_livin','massive','plate']
target_types=['binary','percent']
df = dict()

for i in range(0,len(coral_types)):
    for j in range(0,len(target_types)):
        df[str(coral_types[i])+'_'+str(target_types[j])]=pd.read_csv(str(coral_types[i])+'_'+str(target_types[j])+'.csv')
        del df[str(coral_types[i])+'_'+str(target_types[j])]['Unnamed: 0'] # artifact indexing column
# Access the data as, e.g., df['scler_binary']

## Train/test split, basic GLM

##### Scler (all corals)

In [37]:
X_train, X_test, y_train, y_test = train_test_split(df['scler_binary'].drop(['Sclr_Cr','lat','lon','ID'], axis=1), 
                                                    df['scler_binary']['Sclr_Cr'], 
                                                    test_size = 0.3, random_state = 30)


print('Scler Training Features Shape:', X_train.shape)
print('Scler Training Labels Shape:', y_train.shape)
print('Scler Testing Features Shape:', X_test.shape)
print('Scler Testing Labels Shape:', y_test.shape)

# Create an instance of LogisticRegression
logmodel = LogisticRegression()
# Fit the GLM
logmodel.fit(X_train, y_train)
predictions=logmodel.predict(X_test)
print(classification_report(y_test, predictions))

Scler Training Features Shape: (1372, 9)
Scler Training Labels Shape: (1372,)
Scler Testing Features Shape: (588, 9)
Scler Testing Labels Shape: (588,)
Average baseline error:  0.2
              precision    recall  f1-score   support

           0       0.82      1.00      0.90       480
           1       0.00      0.00      0.00       108

    accuracy                           0.81       588
   macro avg       0.41      0.50      0.45       588
weighted avg       0.67      0.81      0.73       588



We used our all-coral scleractinian grouping dataset to test the efficacy of applying a basic generalized linear model (i.e., a logit model, or GLM) to predict coral occurrence. The application of a basic GLM to these data resulted in predictions that substantially underestimated the occurrence of corals (e.g., the average occurrence in the original training data = 20%; in the predicted data, occurrence = <1%). In light of these poor results, we proceeded with the parameterization of a random forest classifier.

## Train random forest classifier

In [45]:
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 30)

# Train the model on training data
rf.fit(X_train, y_train)

# Predicting to the test data
predictions = rf.predict(X_test)

# Check precision, F1 score
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.86      0.96      0.91       480
           1       0.62      0.31      0.41       108

    accuracy                           0.84       588
   macro avg       0.74      0.63      0.66       588
weighted avg       0.82      0.84      0.82       588



Our random forest binary classifier improved upon the poor performance of the GLM (precision = 0.74 and F1 score = 0.66). We attempted to improve on these scores by adjusting the model to function as a regressor (i.e., produce predictions within the 0-1 range, versus binary 0 or 1 predictions). We interpreted the results based on varying thresholds that simultaneously maximized both precision and F1 scores, but were unable to improve upon the performance of the classifier based on our two chosen metrics.

## Train random forest regressor on binary data, interpret results based on a selected threshold 

In [57]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 30)

# Train the model on training data
rf.fit(X_train, y_train)

In [86]:
# Predicting to the test data
predictions = rf.predict(X_test)

# Interpret results based on a threshold- trial and error
for i in range(0, len(predictions)):
    if predictions[i] < 0.52:
        predictions[i] = 0
    else:
        predictions[i] = 1
        
# Check precision, F1 score
print(classification_report(y_test, predictions))

# Scores to beat: precision = 0.74, f1-score = 0.66
# ...not able to beat this ^

              precision    recall  f1-score   support

           0       0.85      0.96      0.90       480
           1       0.59      0.28      0.38       108

    accuracy                           0.83       588
   macro avg       0.72      0.62      0.64       588
weighted avg       0.81      0.83      0.81       588

