The objective of the project is to train a model that will result in a mathematical equation. This equation will use specific input features (such as temperature, water quality, sunlight exposure, etc.) to predict the percentage of coral bleaching. The goal is for this equation to serve as a practical tool, enabling people to pinpoint factors contributing to coral bleaching and to guide them in taking steps to mitigate or prevent it.

### DATA PREPROCESSING

In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer  
from sklearn.preprocessing import OneHotEncoder      
from sklearn.model_selection import KFold   
from statistics import mean
import joblib


In [136]:
raw_data = pd.read_csv(r'.\dataset\global_bleaching_environmental_cleaned.csv')

### Calculate the percentage of missing value in each column


#### Reasons to Remove Columns with High Missing Value Percentage

#### Data Integrity

- **High Proportion of Missing Data:**
  - If a column has 25% or more missing values, it indicates a significant portion of the data is unavailable. This can undermine the reliability of any analysis or modeling performed using that column.
  
- **Incomplete Information:**
  - With such a high proportion of missing data, the column may not provide a complete or representative view of the feature, potentially leading to biased or incomplete insights.

#### Impact on Model Performance

- **Imputation Uncertainty:**
  - Imputing missing values in columns with such high proportions can introduce substantial uncertainty and may not accurately represent the underlying data distribution.

- **Model Complexity:**
  - Including columns with high missing value percentages can complicate the model and may lead to overfitting if imputation methods are not carefully chosen.

#### Statistical Validity

- **Bias in Imputation:**
  - Imputation methods (mean, median, mode) may not be effective or appropriate for columns with a high percentage of missing values, potentially introducing bias or reducing the validity of the statistical analysis.

#### Data Quality

- **Noise and Error:**
  - Columns with extensive missing data may be indicative of data quality issues or errors in data collection processes, leading to unreliable or noisy datasets.

#### Simplifying the Dataset

- **Focus on Relevant Features:**
  - Removing columns with excessive missing values helps in focusing on more relevant and reliable features, simplifying the dataset and potentially improving model performance.

---


In [137]:
# Calculate the number of missing values per column
missing_values_count = raw_data.isnull().sum()

# Calculate the percentage of missing values per column
total_rows = len(raw_data)
missing_percentage = (missing_values_count / total_rows) * 100

# Create a DataFrame to display the results
missing_data_df = pd.DataFrame({
    'Missing Values': missing_values_count,
    'Percentage': missing_percentage
})

print(missing_data_df)

                     Missing Values  Percentage
Latitude_Degrees                  0    0.000000
Longitude_Degrees                 0    0.000000
Ocean_Name                        0    0.000000
Realm_Name                        0    0.000000
Ecoregion_Name                    3    0.007284
Distance_to_Shore                 2    0.004856
Exposure                          0    0.000000
Turbidity                         6    0.014569
Cyclone_Frequency                 0    0.000000
Depth_m                        1797    4.363345
Percent_Cover                 12420   30.157343
Bleaching_Level                   0    0.000000
Percent_Bleaching              6777   16.455420
ClimSST                         111    0.269522
Temperature_Kelvin              146    0.354507
Temperature_Mean                130    0.315657
Temperature_Maximum             130    0.315657
Windspeed                       127    0.308372
SSTA                            146    0.354507
SSTA_Mean                       130    0

#### Remove the column that have the percentage of missing > 25% (Percent_cover);

In [138]:
column_to_remove = 'Percent_Cover'
# Remove the column
raw_data = raw_data.drop(columns=[column_to_remove])

##### Reason for this:
1. Bias Reduction:
High Missing Rate: If a column has a high percentage of missing values, any attempt to fill in those missing values (e.g., through imputation) can introduce significant bias. The imputed values may not accurately represent the true data, leading to unreliable models.

Reduced Data Quality: Columns with a lot of missing data can degrade the quality of your dataset, as imputed values might not capture the variability and true relationships within the data.

2. Simplification of the Model:
Avoiding Overfitting: Including columns with many missing values might increase the complexity of the model, leading to overfitting. Removing such columns can simplify the model, making it more generalizable.

Improving Interpretability: Fewer, more relevant features make it easier to interpret and understand the model. Columns with high missing values often contribute little to the model's predictive power.

3. Efficient Use of Resources:
Reduced Computational Load: By removing columns with a high percentage of missing values, you reduce the dimensionality of your dataset, leading to faster training and testing times. This is particularly important when working with large datasets.

### Handle the outliers 


#### Handling Outliers Using the Interquartile Range (IQR)

Handling outliers using the Interquartile Range (IQR) is a common and effective method. The IQR is the range between the first quartile (Q1) and the third quartile (Q3) and represents the middle 50% of the data. 

#### Steps to Handle Outliers Using IQR

1. **Calculate the IQR:**
   - Compute Q1 (the first quartile) and Q3 (the third quartile) for each numerical feature.
   
2. **Identify Outliers:**
   - Determine the lower and upper bounds using the IQR:
     - **Lower Bound:** $$ \text{Lower Bound} = Q1 - 1.5 \times \text{IQR} $$
     - **Upper Bound:** $$ \text{Upper Bound} = Q3 + 1.5 \times \text{IQR} $$

3. **Handle Outliers:**
   - Depending on the situation, you can choose from the following methods:
     - **Remove Outliers:** Drop the rows containing outliers.
     - **Cap Outliers:** Replace outliers with the nearest value within the acceptable range (often called winsorizing).
     - **Transform Outliers:** Apply a transformation (e.g., log) to reduce the impact of outliers.

---


### Handling Missing Values



##### Imputation with the Mode

Imputation with the mode involves replacing missing values in a dataset with the most frequent value (the mode) from the column. This method is commonly used for categorical data, where replacing missing values with the most common category is often a sensible approach.

#### How Imputation with Mode Works:

1. **Identify the Mode:**
   - **Definition:** The mode is the value that appears most frequently in a column.
     - For **categorical data**, this is the most common category.
     - For **numerical data**, the mode is the number that occurs most frequently. However, this method is less commonly used for numerical data compared to mean or median imputation.

2. **Replace Missing Values:**
   - **Imputation Process:** Missing values in the column are replaced with the mode. This ensures that the dataset remains complete and the imputed values reflect the most common observed values in that column.

---

#### Imputation Using the Mean

Imputation using the mean value is a common approach for numerical data. This method involves replacing missing values with the mean (average) of the observed values in a column. It’s particularly useful for datasets where missing values are missing at random and the data is approximately normally distributed.

#### How Imputation with Mean Works:

1. **Calculate the Mean:**
   - **Definition:** Compute the mean of the column, excluding the missing values.

2. **Replace Missing Values:**
   - **Imputation Process:** Substitute the missing values with the calculated mean. This helps maintain the overall statistical properties of the dataset.

---


In [139]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Load the data
raw_data = pd.read_csv(r'.\dataset\global_bleaching_environmental_cleaned.csv')

# Categorical columns
categorical_columns = ['Ecoregion_Name']

# Impute missing values in categorical columns with the most frequent value (mode)
imputer_categorical = SimpleImputer(strategy='most_frequent')
raw_data[categorical_columns] = imputer_categorical.fit_transform(raw_data[categorical_columns])

# Numerical columns
numerical_columns = ['Depth_m', 'Percent_Bleaching', 'ClimSST', 'Temperature_Kelvin', 
                     'Temperature_Mean', 'Temperature_Maximum', 'Windspeed', 'SSTA', 
                     'SSTA_Mean', 'SSTA_Maximum', 'SSTA_Frequency', 'SSTA_DHW', 
                     'TSA', 'TSA_Maximum', 'TSA_Mean', 'TSA_Frequency', 'TSA_DHW']

# Impute missing values in numerical columns with the median
imputer_numerical = SimpleImputer(strategy='median')
raw_data[numerical_columns] = imputer_numerical.fit_transform(raw_data[numerical_columns])

# Function to calculate IQR and handle outliers
def handle_outliers_iqr(df, columns):
    for col in columns:
        # Check if the column is numeric
        if pd.api.types.is_numeric_dtype(df[col]):
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            
            # Remove outliers by filtering rows
            df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
        else:
            print(f"Skipping non-numeric column: {col}")
        
    return df

# Apply the function to handle outliers
raw_data = handle_outliers_iqr(raw_data, numerical_columns)

# Save the updated DataFrame to a new file
raw_data.to_csv(r'.\dataset\global_bleaching_environmental_cleaned_imputed_no_outliers.csv', index=False)

print("Data imputed and outliers removed successfully!")


Data imputed and outliers removed successfully!
