In this notebook, I will...
   - [X] run a basic GLM 
   - [X] run random forest classifier on binary scler data
   - [X] evaluate performance based on key metrics
   - load regional dfs, impute, add island col as ID, rbind all 5 dfs 
   - predict out for each group 

#### Scoring models

To score the classifier models, we used the metric of precision, which evaluates performance based on the proportion of predicted positives that are truly positive. The model is penalized for false positives, which improves our confidence that the the regions of the map that are identified as likely to host corals do, in fact, host corals. We also scoring the models using the F1 score, which balances our priorities of correctly identifying locations of corals (precision) and finding as many corals in unsurveyed regions as possible (recall).

## Import modules

In [89]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, confusion_matrix, auc
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import *
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report
import pickle
import requests
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [90]:
os.chdir('C:/Users/linds/OneDrive/Documents/samoa_corals_data')

## Import data

In [91]:
coral_types=['scler','branching','columnar','encrusting','free_livin','massive','plate']
target_types=['binary','percent']
df = dict()

for i in range(0,len(coral_types)):
    for j in range(0,len(target_types)):
        df[str(coral_types[i])+'_'+str(target_types[j])]=pd.read_csv(str(coral_types[i])+'_'+str(target_types[j])+'.csv')
        del df[str(coral_types[i])+'_'+str(target_types[j])]['Unnamed: 0'] # artifact indexing column
# Access the data as, e.g., df['scler_binary']

## Train/test split, basic GLM

##### Scler (all corals)

In [92]:
X_train, X_test, y_train, y_test = train_test_split(df['scler_binary'].drop(['Sclr_Cr','lat','lon','ID'], axis=1), 
                                                    df['scler_binary']['Sclr_Cr'], 
                                                    test_size = 0.3, random_state = 30)


print('Scler Training Features Shape:', X_train.shape)
print('Scler Training Labels Shape:', y_train.shape)
print('Scler Testing Features Shape:', X_test.shape)
print('Scler Testing Labels Shape:', y_test.shape)

# Create an instance of LogisticRegression
logmodel = LogisticRegression()
# Fit the GLM
logmodel.fit(X_train, y_train)
predictions=logmodel.predict(X_test)
print(classification_report(y_test, predictions))

Scler Training Features Shape: (1372, 9)
Scler Training Labels Shape: (1372,)
Scler Testing Features Shape: (588, 9)
Scler Testing Labels Shape: (588,)
              precision    recall  f1-score   support

           0       0.82      1.00      0.90       480
           1       0.00      0.00      0.00       108

    accuracy                           0.81       588
   macro avg       0.41      0.50      0.45       588
weighted avg       0.67      0.81      0.73       588



We used our all-coral scleractinian grouping dataset to test the efficacy of applying a basic generalized linear model (i.e., a logit model, or GLM) to predict coral occurrence. The application of a basic GLM to these data resulted in predictions that substantially underestimated the occurrence of corals (e.g., the average occurrence in the original training data = 20%; in the predicted data, occurrence = <1%). In light of these poor results, we proceeded with the parameterization of a random forest classifier.

## Train random forest classifier

In [93]:
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 30)

# Train the model on training data
rf.fit(X_train, y_train)

# Predicting to the test data
predictions = rf.predict(X_test)

# Check precision, F1 score
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.86      0.96      0.91       480
           1       0.62      0.31      0.41       108

    accuracy                           0.84       588
   macro avg       0.74      0.63      0.66       588
weighted avg       0.82      0.84      0.82       588



In [123]:
# Write results to file
y_test_df = pd.DataFrame(y_test)
test_df = pd.merge(y_test_df, df['scler_binary'], right_index=True, left_index=True, on='Sclr_Cr') # merge based on index
# Reset for ease of merge
test_df = pd.DataFrame.reset_index(test_df)
del test_df['index']

pred_df = pd.DataFrame(predictions)
pred_df.columns = ['prediction']

rf_test_df = pd.merge(test_df, pred_df, right_index=True, left_index=True) # merge based on index
rf_test_df.head()
pd.DataFrame.to_csv(rf_test_df, 'scler_rf_test_df.csv')

## Pickle the 'scler' classifier

In [109]:
with open('scler_classifier_model.pkl', 'wb') as fid:
    pickle.dump(rf, fid, 2)  

# Load the model from disk
loaded_model = pickle.load(open('scler_classifier_model.pkl', 'rb'))
result = loaded_model.score(X_test, y_test)
print(result) # This is a decent r^2 value

0.8384353741496599


Our random forest binary classifier improved upon the poor performance of the GLM (precision = 0.74 and F1 score = 0.66). We attempted to improve on these scores by adjusting the model to function as a regressor (i.e., produce predictions within the 0-1 range, versus binary 0 or 1 predictions). 

## Train random forest regressor on binary data, interpret results based on a selected threshold 

In [57]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 30)

# Train the model on training data
rf.fit(X_train, y_train)

In [86]:
# Predicting to the test data
predictions = rf.predict(X_test)

# Interpret results based on a threshold- trial and error
for i in range(0, len(predictions)):
    if predictions[i] < 0.52:
        predictions[i] = 0
    else:
        predictions[i] = 1
        
# Check precision, F1 score
print(classification_report(y_test, predictions))

# Scores to beat: precision = 0.74, f1-score = 0.66
# ...not able to beat this ^

              precision    recall  f1-score   support

           0       0.85      0.96      0.90       480
           1       0.59      0.28      0.38       108

    accuracy                           0.83       588
   macro avg       0.72      0.62      0.64       588
weighted avg       0.81      0.83      0.81       588



We interpreted the results based on varying thresholds that simultaneously maximized both precision and F1 scores, which resulted in superior model performance, particularly for those coral groups in in which the rare occurrence of corals proved too challenging for the classifier to handle.

## Train random forest classifiers for remaining coral groups

##### Branching

In [148]:
X_train, X_test, y_train, y_test = train_test_split(df['branching_binary'].drop(['Brnch_C','lat','lon','ID'], axis=1), 
                                                    df['branching_binary']['Brnch_C'], 
                                                    test_size = 0.3, random_state = 30)


print('Scler Training Features Shape:', X_train.shape)
print('Scler Training Labels Shape:', y_train.shape)
print('Scler Testing Features Shape:', X_test.shape)
print('Scler Testing Labels Shape:', y_test.shape)

rf = RandomForestRegressor(n_estimators = 1000, random_state = 30)
rf.fit(X_train, y_train)

Scler Training Features Shape: (2214, 9)
Scler Training Labels Shape: (2214,)
Scler Testing Features Shape: (949, 9)
Scler Testing Labels Shape: (949,)


RandomForestRegressor(n_estimators=1000, random_state=30)

In [149]:
predictions = rf.predict(X_test)
for i in range(0, len(predictions)):
    if predictions[i] < 0.33:
        predictions[i] = 0
    else:
        predictions[i] = 1
        
print(classification_report(y_test, predictions))

# 0.1 0.52 0.53
# 0.2 0.57 0.57
# 0.3 0.6 0.59
# 0.31 0.61 0.59
# 0.33 0.62 0.6
# 0.34 0.6 0.58

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       927
           1       0.25      0.18      0.21        22

    accuracy                           0.97       949
   macro avg       0.62      0.58      0.60       949
weighted avg       0.96      0.97      0.97       949



In [154]:
y_test_df = pd.DataFrame(y_test)
test_df = pd.merge(y_test_df, df['branching_binary'], right_index=True, left_index=True, on='Brnch_C') 
test_df = pd.DataFrame.reset_index(test_df)
del test_df['index']
pred_df = pd.DataFrame(predictions)
pred_df.columns = ['prediction']

rf_test_df = pd.merge(test_df, pred_df, right_index=True, left_index=True)
rf_test_df.head()
pd.DataFrame.to_csv(rf_test_df, 'branching_rf_test_df.csv')

with open('branching_classifier_model.pkl', 'wb') as fid:
    pickle.dump(rf, fid, 2)  

# Load the model from disk
loaded_model = pickle.load(open('branching_classifier_model.pkl', 'rb'))
result = loaded_model.score(X_test, y_test)
print(result) # This is odd

-0.07894129219378287


##### Columnar

##### Encrusting

##### Free living

##### Massive

##### Plate-like