In this notebook, I will...
   - calculate baseline error
   - run random forest regressors on abundance 
   - load regional dfs, impute, add island col as ID, rbind all 5 dfs 
   - predict out for each group 

#### Scoring models

We evaluated the performance of the abundance models based on mean average error (accuracy) and mean absolute percent error (MAPE).

## Import modules

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, confusion_matrix, auc
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import *
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report
import pickle
import requests
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [2]:
os.chdir('C:/Users/linds/OneDrive/Documents/samoa_corals_data')

## Import data

In [3]:
coral_types=['scler','branching','columnar','encrusting','free_livin','massive','plate']
target_types=['binary','percent']
df = dict()

for i in range(0,len(coral_types)):
    for j in range(0,len(target_types)):
        df[str(coral_types[i])+'_'+str(target_types[j])]=pd.read_csv(str(coral_types[i])+'_'+str(target_types[j])+'.csv')
        del df[str(coral_types[i])+'_'+str(target_types[j])]['Unnamed: 0'] # artifact indexing column
# Access the data as, e.g., df['scler_percent']

# Scler (all corals)

### Train/test split, baseline error

The baseline is the estimate I would get if I simply predicted the average abundance across all cells. If I can improve upon this by using my model, then my approach is valid.

In [42]:
X_train, X_test, y_train, y_test = train_test_split(df['scler_percent'].drop(['Sclr_Rw','lat','lon','ID'], axis=1), 
                                                    df['scler_percent']['Sclr_Rw'], 
                                                    test_size = 0.3, random_state = 30)


print('Scler Training Features Shape:', X_train.shape)
print('Scler Training Labels Shape:', y_train.shape)
print('Scler Testing Features Shape:', X_test.shape)
print('Scler Testing Labels Shape:', y_test.shape)

# The baseline predictions are the averages

baseline_preds = np.array([y_train.mean()] * len(y_train))
baseline_errors = abs(baseline_preds - y_train)
print('Baseline prediction error: ', round(np.mean(baseline_errors), 2))

Scler Training Features Shape: (1372, 9)
Scler Training Labels Shape: (1372,)
Scler Testing Features Shape: (588, 9)
Scler Testing Labels Shape: (588,)
Baseline prediction error:  14.31


### Train random forest regressor

In [43]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 30)

# Train the model on training data
rf.fit(X_train, y_train)

# Predicting to the test data
predictions = rf.predict(X_test)

# Calculate the absolute errors
errors = abs(predictions - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error: ', round(np.mean(errors), 2))

Mean Absolute Error:  10.22


Using the random forest regressor reduced predictive error by nearly 30% compared to the application of a "baseline average" for prediction (i.e., the calculation of a uniform basic average abundance of coral across the sample sites).

### Pickle the model

In [51]:
with open('scler_abundance_model.pkl', 'wb') as fid:
    pickle.dump(rf, fid, 2)

# Branching

In [57]:
X_train, X_test, y_train, y_test = train_test_split(df['branching_percent'].drop(['Brnch_R','lat','lon','ID'], axis=1), 
                                                    df['branching_percent']['Brnch_R'], 
                                                    test_size = 0.3, random_state = 99)


print('Scler Training Features Shape:', X_train.shape)
print('Scler Training Labels Shape:', y_train.shape)
print('Scler Testing Features Shape:', X_test.shape)
print('Scler Testing Labels Shape:', y_test.shape)

baseline_preds = np.array([y_train.mean()] * len(y_train))
baseline_errors = abs(baseline_preds - y_train)
print('Baseline prediction error: ', round(np.mean(baseline_errors), 2))

rf = RandomForestRegressor(n_estimators = 1000, random_state = 99)
rf.fit(X_train, y_train)

predictions = rf.predict(X_test)
errors = abs(predictions - y_test)

print('Mean Absolute Error: ', round(np.mean(errors), 2))

Scler Training Features Shape: (2214, 9)
Scler Training Labels Shape: (2214,)
Scler Testing Features Shape: (949, 9)
Scler Testing Labels Shape: (949,)
Baseline prediction error:  2.54
Mean Absolute Error:  2.18


# Columnar

In [60]:
X_train, X_test, y_train, y_test = train_test_split(df['columnar_percent'].drop(['Clmnr_R','lat','lon','ID'], axis=1), 
                                                    df['columnar_percent']['Clmnr_R'], 
                                                    test_size = 0.3, random_state = 120)


print('Scler Training Features Shape:', X_train.shape)
print('Scler Training Labels Shape:', y_train.shape)
print('Scler Testing Features Shape:', X_test.shape)
print('Scler Testing Labels Shape:', y_test.shape)

baseline_preds = np.array([y_train.mean()] * len(y_train))
baseline_errors = abs(baseline_preds - y_train)
print('Baseline prediction error: ', round(np.mean(baseline_errors), 2))

rf = RandomForestRegressor(n_estimators = 1000, random_state = 99)
rf.fit(X_train, y_train)

predictions = rf.predict(X_test)
errors = abs(predictions - y_test)

print('Mean Absolute Error: ', round(np.mean(errors), 2))

Scler Training Features Shape: (2214, 9)
Scler Training Labels Shape: (2214,)
Scler Testing Features Shape: (949, 9)
Scler Testing Labels Shape: (949,)
Baseline prediction error:  0.2
Mean Absolute Error:  0.1


# Encrusting