# INTRODUCTION

# DATA EXPLORATION    
## Setup
- **pandas** : Provides data structures and data analysis tools for handling and manipulating data.
- **seaborn** : For better data visualization capabilities
- **numpy** : For numerical operations and array manipulations.
- **matplotlib.pyplot** : Offers functions for creating and customizing plots.
- **sklearn.base.BaseEstimator**: Provides a base class for all scikit-learn estimators, allowing for custom estimators.
- **sklearn.base.TransformerMixin**: Provides a mixin class for transformers, enabling custom transformations of data.
- **sklearn.pipeline.FeatureUnion**: Allows for combining multiple feature extraction methods into a single feature union.
- **sklearn.pipeline.Pipeline**: For the creation of machine learning pipelines by chaining together multiple steps.
- **sklearn.preprocessing.StandardScaler**: Standardizes features by removing the mean and scaling to unit variance.
- **sklearn.impute.SimpleImputer**: Handles missing values by imputing them with a specified strategy.
- **sklearn.preprocessing.OneHotEncoder**: Converts categorical features into a one-hot encoded format.
- **sklearn.model_selection.KFold**: Provides k-fold cross-validation for evaluating model performance.
- **statistics.mean**: Computes the arithmetic mean of a list of numbers.
- **joblib**: Provides utilities for saving and loading Python objects, particularly for machine learning models.

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer  
from sklearn.preprocessing import OneHotEncoder      
from sklearn.model_selection import KFold   
from statistics import mean
import joblib 

## Loading the data

In [3]:
raw_data = pd.read_csv(r'c:\Users\locha\AI\ASM3\global_bleaching_environmental.csv')
raw_data.head()

  raw_data = pd.read_csv(r'c:\Users\locha\AI\ASM3\global_bleaching_environmental.csv')


Unnamed: 0,Site_ID,Sample_ID,Data_Source,Latitude_Degrees,Longitude_Degrees,Ocean_Name,Reef_ID,Realm_Name,Ecoregion_Name,Country_Name,...,TSA_FrequencyMax,TSA_FrequencyMean,TSA_DHW,TSA_DHW_Standard_Deviation,TSA_DHWMax,TSA_DHWMean,Date,Site_Comments,Sample_Comments,Bleaching_Comments
0,2501,10324336,Donner,23.163,-82.526,Atlantic,nd,Tropical Atlantic,Cuba and Cayman Islands,Cuba,...,5,0,0.0,0.74,7.25,0.18,2005-09-15,nd,nd,nd
1,3467,10324754,Donner,-17.575,-149.7833,Pacific,nd,Eastern Indo-Pacific,Society Islands French Polynesia,French Polynesia,...,4,0,0.26,0.67,4.65,0.19,1991-03-15,The bleaching does not appear to have gained ...,The bleaching does not appear to have gained ...,nd
2,1794,10323866,Donner,18.369,-64.564,Atlantic,nd,Tropical Atlantic,Hispaniola Puerto Rico and Lesser Antilles,United Kingdom,...,7,0,0.0,1.04,11.66,0.26,2006-01-15,nd,nd,nd
3,8647,10328028,Donner,17.76,-64.568,Atlantic,nd,Tropical Atlantic,Hispaniola Puerto Rico and Lesser Antilles,United States,...,4,0,0.0,0.75,5.64,0.2,2006-04-15,nd,nd,nd
4,8648,10328029,Donner,17.769,-64.583,Atlantic,nd,Tropical Atlantic,Hispaniola Puerto Rico and Lesser Antilles,United States,...,5,0,0.0,0.92,6.89,0.25,2006-04-15,nd,nd,nd


## First inspection

In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41361 entries, 0 to 41360
Data columns (total 62 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Site_ID                                41361 non-null  int64  
 1   Sample_ID                              41361 non-null  int64  
 2   Data_Source                            41361 non-null  object 
 3   Latitude_Degrees                       41361 non-null  float64
 4   Longitude_Degrees                      41361 non-null  float64
 5   Ocean_Name                             41361 non-null  object 
 6   Reef_ID                                41361 non-null  object 
 7   Realm_Name                             41361 non-null  object 
 8   Ecoregion_Name                         41361 non-null  object 
 9   Country_Name                           41361 non-null  object 
 10  State_Island_Province_Name             41361 non-null  object 
 11  Ci

Because our dataset has a large number of features, we have decided to drop several columns and keep the essential ones for Exploratory Data Analysis (EDA) for a more efficient analysis.
Here are the features we deemed essential:
- **Latitude_Degrees**: To know the geographical placement.
- **Longtitude_Degrees**: To know the geographical placement.
- **Ocean_Name**: To explore differences between oceans.
- **Realm_Name**: To examine variations over different realms.
- **Distance_to_Shore**: To study the relationship between proximity to shore and coral bleaching.
- **Exposure**: To assess how exposure affects bleaching.
- **Turbidity**: To analyze the effect of water clarity on coral health.
- **Cyclone_Frequency**: To evaluate the impact of cyclone frequency on coral bleaching.
- **Depth_m**: To study the influence of depth on coral bleaching.
- **Percent_Cover**: To analyze the extent of coral cover.
- **Bleaching_Level**: For understanding the severity of bleaching.
- **Percent_Bleaching**: To quantify the extent of bleaching.
- **Temperature_Mean**: To explore the impact of average temperature on bleaching.
- **Temperature_Maximum**: To study the effect of maximum temperature on bleaching.
- **Temperature_Kelvin**: For understanding temperature in absolute terms.
- **Windspeed**: To investigate the role of wind speed in bleaching.
- **SSTA_Maximum**: To study the impact of extreme sea surface temperature anomalies.
- **SSTA_DHW**: To explore the relationship between degree heating weeks and bleaching.
- **Date**: To study the change in bleaching over time.

In [5]:
#List out the chosen features.
main_features = ['Latitude_Degrees',
    'Longitude_Degrees',
    'Ocean_Name',
    'Realm_Name',
    'Distance_to_Shore',
    'Exposure',
    'Turbidity',
    'Cyclone_Frequency',
    'Depth_m',
    'Percent_Cover',
    'Bleaching_Level',
    'Percent_Bleaching',
    'Temperature_Mean',
    'Temperature_Maximum',
    'Temperature_Kelvin',
    'Windspeed',
    'SSTA_Maximum',
    'SSTA_DHW',
    'Date' ]

#Drop all features not mentioned above.
df_filtered = raw_data[main_features]

In [6]:
#Double checking
df_filtered.head()

Unnamed: 0,Latitude_Degrees,Longitude_Degrees,Ocean_Name,Realm_Name,Distance_to_Shore,Exposure,Turbidity,Cyclone_Frequency,Depth_m,Percent_Cover,Bleaching_Level,Percent_Bleaching,Temperature_Mean,Temperature_Maximum,Temperature_Kelvin,Windspeed,SSTA_Maximum,SSTA_DHW,Date
0,23.163,-82.526,Atlantic,Tropical Atlantic,8519.23,Exposed,0.0287,49.9,10.0,nd,nd,50.2,300.67,304.69,302.05,8,2.24,0.0,2005-09-15
1,-17.575,-149.7833,Pacific,Eastern Indo-Pacific,1431.62,Exposed,0.0262,51.2,14.0,nd,nd,50.7,300.73,305.01,303.3,2,3.1,0.26,1991-03-15
2,18.369,-64.564,Atlantic,Tropical Atlantic,182.33,Exposed,0.0429,61.52,7.0,nd,nd,50.9,300.32,304.14,299.18,8,2.83,0.0,2006-01-15
3,17.76,-64.568,Atlantic,Tropical Atlantic,313.13,Exposed,0.0424,65.39,9.02,nd,nd,50.9,300.38,304.07,299.61,3,2.47,0.0,2006-04-15
4,17.769,-64.583,Atlantic,Tropical Atlantic,792.0,Exposed,0.0424,65.39,12.5,nd,nd,50.9,300.38,303.76,299.7,3,2.3,0.0,2006-04-15


## Handling missing values
The feature "Percent_Bleaching", which is our main feature, has many values reported as 'nd' which might obscure your models. Therefore, we have elected to remove all rows containg 'nd' in the "Percent_Bleaching" column.

In [8]:
df_clean = df_filtered[df_filtered['Percent_Bleaching'] != 'nd']

## EDA (Exploratory data analysis) 