## The following codes show the performance of the test and train data using GBM model

### **Gradient Boosting Machine (GBM) Overview**

Gradient Boosting Machine (GBM) is a powerful ensemble learning technique commonly used for regression and classification tasks. It builds predictive models by sequentially combining the outputs of weaker learners (typically decision trees) to improve overall accuracy.

In this project, GBM is used to predict the target variables (`size_nm`, `S_abs_nm_Y1`, and `PL`) by learning from the provided dataset. The key concepts behind GBM include:

1. **Boosting**: 
   - Boosting is an iterative technique where each subsequent model corrects the errors of its predecessor.
   - GBM builds models in a stage-wise fashion, optimizing the loss function at each step.

2. **Gradient Descent**:
   - GBM minimizes the loss function (e.g., Mean Squared Error for regression) by using gradient descent to update predictions iteratively.

3. **Hyperparameters**:
   - **Number of Trees (`n_estimators`)**: Determines the number of weak learners to be combined.
   - **Learning Rate (`learning_rate`)**: Controls the contribution of each tree to the overall model, balancing between underfitting and overfitting.
   - **Maximum Features (`max_features`)**: Limits the number of features considered at each split for reducing complexity.

4. **Advantages**:
   - Handles complex, non-linear relationships effectively.
   - Robust to missing data and noisy datasets.

5. **Limitations**:
   - Can be prone to overfitting if hyperparameters are not tuned carefully.
   - Training can be computationally expensive for large datasets.

In this analysis, GBM is optimized using **Grid Search** with a set of hyperparameters to find the best model. Model performance is evaluated using metrics such as:
- **R² (Coefficient of Determination)**: Measures how well the model explains the variance in the target variable.
- **RMSE (Root Mean Squared Error)**: Quantifies the average magnitude of errors in predictions.
- **MAE (Mean Absolute Error)**: Captures the average absolute difference between observed and predicted values.


In [None]:
# Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, RepeatedKFold, train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder

# File Paths
file_path_original = "./CsPbCl3_QDs.xlsx"  # Original dataset
file_path_modified = "./modified_data.xlsx"  # Preprocessed dataset

# Step 1: Load and Preprocess Data
def load_and_preprocess_data(file_path):
    """
    Load and preprocess the dataset.
    
    Parameters:
        file_path (str): Path to the Excel file.
    
    Returns:
        pd.DataFrame: Preprocessed dataset.
    """
    data = pd.read_excel(file_path)
    
    # Identify categorical columns
    categorical_columns = data.select_dtypes(include=['object']).columns
    
    # Apply one-hot encoding to categorical columns
    one_hot_encoder = OneHotEncoder(sparse_output=False)
    one_hot_encoded = one_hot_encoder.fit_transform(data[categorical_columns])
    one_hot_encoded_df = pd.DataFrame(
        one_hot_encoded, 
        columns=one_hot_encoder.get_feature_names_out(categorical_columns)
    )
    
    # Replace categorical columns with one-hot encoded columns
    data_encoded = data.drop(categorical_columns, axis=1)
    data_encoded = pd.concat([data_encoded, one_hot_encoded_df], axis=1)
    return data_encoded

# Load datasets
data_original = load_and_preprocess_data(file_path_original)
data_modified = load_and_preprocess_data(file_path_modified)

# Step 2: Prepare Data for Machine Learning
def prepare_ml_data(data, target_column):
    """
    Prepare the dataset for machine learning.
    
    Parameters:
        data (pd.DataFrame): Dataset.
        target_column (str): Target variable.
    
    Returns:
        Tuple: Features (X), target (y), and train-test splits (X_train, X_test, y_train, y_test).
    """
    X = data.drop(target_column, axis=1)
    y = data[target_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    return X_train, X_test, y_train, y_test

# Step 3: Train Gradient Boosting Model
def train_gradient_boosting(X_train, y_train, X_test, y_test):
    """
    Train a Gradient Boosting Regressor on the dataset.
    
    Parameters:
        X_train (pd.DataFrame): Training features.
        y_train (pd.Series): Training target.
        X_test (pd.DataFrame): Testing features.
        y_test (pd.Series): Testing target.
    
    Returns:
        dict: Model predictions and performance metrics.
    """
    # Fill missing values
    X_train_filled = X_train.fillna(X_train.mean())
    X_test_filled = X_test.fillna(X_train.mean())
    y_train_filled = y_train.fillna(y_train.mean())
    
    # Define parameter grid
    param_grid = {
        'max_features': ['sqrt', 'log2'],
        'n_estimators': [100, 150, 200],
        'learning_rate': [0.05, 0.1, 0.15]
    }
    gbm = GradientBoostingRegressor(random_state=42)
    grid_search = GridSearchCV(gbm, param_grid, cv=RepeatedKFold(n_splits=5, n_repeats=3), scoring='r2', verbose=1)
    grid_search.fit(X_train_filled, y_train_filled)
    
    # Predictions
    predictions_train = grid_search.predict(X_train_filled)
    predictions_test = grid_search.predict(X_test_filled)
    
    # Performance metrics
    metrics = {
        "Train R2": r2_score(y_train_filled, predictions_train),
        "Train RMSE": np.sqrt(mean_squared_error(y_train_filled, predictions_train)),
        "Train MAE": mean_absolute_error(y_train_filled, predictions_train),
        "Test R2": r2_score(y_test, predictions_test),
        "Test RMSE": np.sqrt(mean_squared_error(y_test, predictions_test)),
        "Test MAE": mean_absolute_error(y_test, predictions_test)
    }
    return {
        "predictions_train": predictions_train,
        "predictions_test": predictions_test,
        "metrics": metrics
    }

# Step 4: Evaluate Targets
targets = ['size_nm', 'S_abs_nm_Y1', 'PL']
results = {}

for target in targets:
    print(f"Evaluating target: {target}")
    X_train, X_test, y_train, y_test = prepare_ml_data(data_modified, target)
    results[target] = train_gradient_boosting(X_train, y_train, X_test, y_test)

    # Print metrics
    print(f"Metrics for {target}:")
    for metric, value in results[target]["metrics"].items():
        print(f"  {metric}: {value:.4f}")
    print("\n")

# Step 5: Visualization
fig, axs = plt.subplots(3, 2, figsize=(15, 15))

for i, target in enumerate(targets):
    y_test = results[target]['predictions_test']
    predictions_test = results[target]['predictions_test']
    
    # Plot 1: Observed vs Predicted
    sns.scatterplot(x=np.arange(len(y_test)), y=y_test, ax=axs[i, 0], label='Observed', color='red')
    sns.scatterplot(x=np.arange(len(predictions_test)), y=predictions_test, ax=axs[i, 0], label='Predicted', color='blue')
    axs[i, 0].set_title(f'{target} - Observed vs Predicted')
    
    # Plot 2: Residuals
    residuals = y_test - predictions_test
    sns.histplot(residuals, ax=axs[i, 1], kde=True, color='green')
    axs[i, 1].set_title(f'{target} - Residuals Distribution')

plt.tight_layout()
plt.show()
