# Machine Learning 
## The notebook will build a Random Forest Classifier model to investigate coral bleaching
### This is an academic project from Dr. Anna Haywood machine learning workshop
### The project aims to build a random forest to predict coral bleaching areas

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.metrics import (
    accuracy_score,
    classification_report, 
    confusion_matrix,
    roc_curve,
    auc,
    precision_recall_curve
)

The data is from https://noaa.maps.arcgis.com/apps/instant/portfolio/index.html?appid=8dbe7b587c1d463bb2ad65e8b65da0d5. The imported data is from Fiji with 5 bleaching alert areas (0, 1, 2, 3, 4) with 0 indicating no bleaching and 5 is heavily bleaching. 

In [3]:
file_path = '/Users/hangtran/Desktop/Project_4/Fiji (1).csv'
data = pd.read_csv(file_path)

# Preview data
print("Data preview: \n", data.head(), "\n")
required_columns = [
    'Date', 'Latitude', 'Longitude',
    'Sea_Surface_Temperature', 'HotSpots',
    'Degree_Heating_Weeks',  'Bleaching_Alert_Area'
]
# Check missing columns
for col in required_columns:
    if col not in data.columns:
        raise ValueError(f'Missing columns: {col}')

Data preview: 
          Date  Latitude  Longitude  Sea_Surface_Temperature  HotSpots  \
0  01/01/1985   -16.775     179.35                    27.80       0.0   
1  01/02/1985   -16.775     179.35                    27.99       0.0   
2  01/03/1985   -16.775     179.35                    28.06       0.0   
3  01/04/1985   -16.775     179.35                    26.94       0.0   
4  01/05/1985   -16.775     179.35                    27.03       0.0   

   Degree_Heating_Weeks  Bleaching_Alert_Area  
0                   0.0                     0  
1                   0.0                     0  
2                   0.0                     0  
3                   0.0                     0  
4                   0.0                     0   



We begin to train a Random Forest Classifier with the target is the bleaching alert area mention above

In [4]:
# Split features (X) and target (y)
X = data.drop(columns = ['Date', 'Bleaching_Alert_Area'])
y = data['Bleaching_Alert_Area']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train random forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print(f'Accuracy score: {accuracy: 0.5f} \n')
print(f'Confusion matrix: \n {confusion}')
print("Classification report:")
print(classification_report(y_test, y_pred))

Accuracy score:  0.90844 

Confusion matrix: 
 [[1776   44    0    0    0]
 [ 134  647   21    3    6]
 [   2   41  140    0    0]
 [   0   10    1   57    1]
 [   0    6    0    0   49]]
Classification report:
              precision    recall  f1-score   support

           0       0.93      0.98      0.95      1820
           1       0.86      0.80      0.83       811
           2       0.86      0.77      0.81       183
           3       0.95      0.83      0.88        69
           4       0.88      0.89      0.88        55

    accuracy                           0.91      2938
   macro avg       0.90      0.85      0.87      2938
weighted avg       0.91      0.91      0.91      2938



## Result: 
The model is performing well with an accuracy of 91%. We have 5 classes (0, 1, 2, 3, 4) indicating 5 bleaching alert areas. 

For class 0, high precision (93%) and high recall (98%) meaning models capture all true positive case. 

For class 1, lower recall suggesting some of instance are misclassfied. 

For class 2,3, and 4, overall good precision and recall but their number of samples (support) are low. 

Now, we will begin to address issue of our model. There are 2 prominent issues
1. Imbalance of classes with class 0 is overrepresented in the dataset
2. Parameters is not optimized 

## Fixing the imblance of classes with SMOTE

In [6]:
# Address class imbalance, first check the frequency of each class
print(y.value_counts())
# Class 0 is much larger than the other classes
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42, sampling_strategy='auto')
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print(y_train_resampled.value_counts()) #Now, every class is balance


Bleaching_Alert_Area
0    9081
1    3933
2     998
3     376
4     300
Name: count, dtype: int64
Bleaching_Alert_Area
0    7261
1    7261
3    7261
2    7261
4    7261
Name: count, dtype: int64


Now, every class has been resampled by SMOTE, we can build the model again with this balanced data. Notice, we will not touch the test data since there will be data leak

In [7]:
# Fit model, predict and evaluation after SMOTE
rf.fit(X_train_resampled, y_train_resampled)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print(f'Accuracy score after resample: {accuracy: 0.5f} \n')
print(f'Confusion matrix after resample: \n {confusion}')
print("Classification report after resample:")
print(classification_report(y_test, y_pred))

Accuracy score after resample:  0.89346 

Confusion matrix after resample: 
 [[1742   77    1    0    0]
 [ 116  612   60    9   14]
 [   2   25  156    0    0]
 [   0    4    1   63    1]
 [   0    3    0    0   52]]
Classification report after resample:
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1820
           1       0.85      0.75      0.80       811
           2       0.72      0.85      0.78       183
           3       0.88      0.91      0.89        69
           4       0.78      0.95      0.85        55

    accuracy                           0.89      2938
   macro avg       0.83      0.88      0.85      2938
weighted avg       0.89      0.89      0.89      2938



## Result: 
Though the accuracy value is lower (89%) we can be assured that our model is now better at predict other classes and not biased toward class 0. 

SMOTE introduce synthetic values for underepresented classes forcing the model to learn better. 

The recall of class 2 and class 4 is improving meaning model is better at detecting minority class

## Hyperparameters tuning with RandomizedSearchCV

In [8]:
# Hyperparameter tunning 
from sklearn.model_selection import RandomizedSearchCV
parameters = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'class_weight':['balanced', None]
}
rf_hyper = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(
    param_distributions=parameters,
    estimator=rf_hyper,
    cv=3,
    scoring='f1_weighted'
)
random_search.fit(X_train_resampled, y_train_resampled)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print(f'Accuracy score after resample & hyperparameters tuning: {accuracy: 0.5f} \n')
print(f'Confusion matrix after resample & hyperparamters tuning: \n {confusion}')
print("Classification report after resample & hyperparameters tuning:")
print(classification_report(y_test, y_pred))

Accuracy score after resample & hyperparameters tuning:  0.89381 

Confusion matrix after resample & hyperparamters tuning: 
 [[1748   71    1    0    0]
 [ 117  607   62   10   15]
 [   2   25  156    0    0]
 [   0    4    1   63    1]
 [   0    3    0    0   52]]
Classification report after resample & hyperparameters tuning:
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1820
           1       0.85      0.75      0.80       811
           2       0.71      0.85      0.77       183
           3       0.86      0.91      0.89        69
           4       0.76      0.95      0.85        55

    accuracy                           0.89      2938
   macro avg       0.83      0.88      0.85      2938
weighted avg       0.89      0.89      0.89      2938



## Result: 
The accuracy value increase slightly (89.381%), the model performs better after tuning. 

The confusion matrix shows model performs well across all classes. 

Class 0 has higher of true positive meaning it is well classified

Classes 1 and 2 still have noticeable misclassifications, but they are more balanced than before, likely due to the resampling technique.

Classification Report: Class 0 still performs the best, but classes 1 and 3 show decent recall, meaning they are being identified more frequently.

Class 4 has a high recall (0.95), meaning the model is very good at detecting it, though precision could still be better (0.76).

## Conclusion:
Random Forest Model performs well with the accuracy of 89.381% after resample with SMOTE and hyperparameter tuning. 

The macro average shows a balanced performance across all classes and the weighted average confirms that model is doing well. 

However, the model has a high weighted F1-score (0.89) meaning the model is still favoring the larger classes.

## Next step for future improvement:

1. Further hyperparameter tuning
2. Feature Engineering
3. Advance resampling technique
4. Ensemble models as XGBoost and LightGBM


