## Decision Tree Model for Performance Evaluation



## Overview

This script evaluates the performance of a Decision Tree Regressor on three target variables (`size_nm`, `S_abs_nm_Y1`, and `PL`) using a dataset of quantum dot features. The following steps were performed:

1. **Data Preprocessing**:
   - Handled missing values by imputing the median.
   - Scaled features using Min-Max Scaling.
   - Applied one-hot encoding to categorical variables for machine learning compatibility.

2. **Model Training and Hyperparameter Tuning**:
   - Used a **Decision Tree Regressor** for predictions.
   - Employed **RandomizedSearchCV** to optimize hyperparameters, including:
     - `max_depth`: Maximum depth of the tree.
     - `min_samples_split`: Minimum samples required to split a node.
     - `min_samples_leaf`: Minimum samples required to be at a leaf node.

3. **Evaluation Metrics**:
   - **R²**: Coefficient of determination, indicating how well the model explains variance.
   - **RMSE**: Root Mean Squared Error, measuring prediction accuracy.
   - **MAE**: Mean Absolute Error, measuring average prediction error.

4. **Visualizations**:
   - Scatter plots comparing observed and predicted values.
   - Residual analysis for error evaluation.
   - Visualization of the trained Decision Tree for interpretability.

## Data Processing Steps

- **Handling Missing Data**: Missing values in the dataset were filled with the median of each column.
- **Scaling**: Feature scaling was applied using Min-Max Scaling to normalize the input range.
- **One-Hot Encoding**: Converted categorical columns into numerical columns using one-hot encoding.

## Model Evaluation

For each target variable (`size_nm`, `S_abs_nm_Y1`, `PL`), the following metrics were calculated:

- **Train Data Performance**:
  - R²
  - RMSE
  - MAE

- **Test Data Performance**:
  - R²
  - RMSE
  - MAE

The results were stored in dictionaries and displayed for easy interpretation.

## Code Summary

The following Python packages were used:

- `pandas` for data manipulation.
- `scikit-learn` for machine learning and evaluation.
- `matplotlib` and `seaborn` for visualizations.

## Results and Insights

- The Decision Tree model provides an interpretable structure for understanding feature contributions.
- Residual plots help identify patterns in prediction errors, revealing potential areas for model improvement.
- Hyperparameter tuning ensures the model achieves optimal performance for each target variable.

## Decision Tree Visualization

The tree structure reveals the hierarchy of feature splits, aiding in understanding the decision-making process of the model.

#

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer

# Load the data
file_path1 = "/Users/mehmetsiddik/Desktop/Musa/modified_data.xlsx"
file_path2 = "/Users/mehmetsiddik/Desktop/Musa/CsPbCl3_QDs.xlsx"
CsPbCl3 = pd.read_excel(file_path1)

# Identify categorical columns
categorical_columns = CsPbCl3.select_dtypes(include=['object']).columns

# Apply one-hot encoding to categorical columns
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoded = one_hot_encoder.fit_transform(CsPbCl3[categorical_columns])
one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names(categorical_columns))

# Replace categorical columns with one-hot encoded columns
CsPbCl3_encoded = CsPbCl3.drop(categorical_columns, axis=1)
CsPbCl3_encoded = pd.concat([CsPbCl3_encoded, one_hot_encoded_df], axis=1)

# Target variables
targets = ['size_nm', 'S_abs_nm_Y1', 'PL']

# Initialize dictionaries to store results and predictions
results = {}
predictions = {}

# Model training and evaluation for each target
for target in targets:
    print(f"Evaluating target: {target}")
    
    # Prepare features and target
    X = CsPbCl3_encoded.drop(target, axis=1)
    y = CsPbCl3_encoded[target]

    # Fill missing values with the median
    imputer = SimpleImputer(missing_values=np.nan, strategy='median')
    X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

    # Scale features
    scaler = MinMaxScaler()
    X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Decision Tree Regressor with hyperparameter tuning
    param_dist = {
        'max_depth': [None, 10, 20, 30, 40, 50],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    dtr = DecisionTreeRegressor(random_state=42)
    random_search = RandomizedSearchCV(dtr, param_dist, cv=5, n_iter=10, random_state=42, verbose=1)
    random_search.fit(X_train, y_train)

    # Evaluate the best model
    best_model = random_search.best_estimator_
    predictions_train = best_model.predict(X_train)
    predictions_test = best_model.predict(X_test)

    # Store predictions
    predictions[target] = {
        'y_test': y_test,
        'predictions_test': predictions_test
    }

    # Performance metrics
    results[target] = {
        'Train R2': r2_score(y_train, predictions_train),
        'Test R2': r2_score(y_test, predictions_test),
        'Train RMSE': np.sqrt(mean_squared_error(y_train, predictions_train)),
        'Test RMSE': np.sqrt(mean_squared_error(y_test, predictions_test)),
        'Train MAE': mean_absolute_error(y_train, predictions_train),
        'Test MAE': mean_absolute_error(y_test, predictions_test)
    }

    # Print performance metrics
    print("Performance for train data:")
    print("R2:", r2_score(y_train, predictions_train))
    print("RMSE:", np.sqrt(mean_squared_error(y_train, predictions_train)))
    print("MAE:", mean_absolute_error(y_train, predictions_train))

    print("Performance for test data:")
    print("R2:", r2_score(y_test, predictions_test))
    print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions_test)))
    print("MAE:", mean_absolute_error(y_test, predictions_test))
    print("\n")

# Display results
print(results)

# Plotting results
fig, axs = plt.subplots(3, 2, figsize=(12, 12))

titles = ['Size (size_nm)', '1 S abs (S_abs_nm_Y1)', 'PL']

for i, target in enumerate(targets):
    y_test = predictions[target]['y_test']
    predictions_test = predictions[target]['predictions_test']

    # Plot observed vs sample numbers
    sns.scatterplot(
        x=np.arange(1, len(y_test) + 1),
        y=y_test.values,
        ax=axs[i, 0],
        label='Observed',
        color='red'
    )
    sns.scatterplot(
        x=np.arange(1, len(y_test) + 1),
        y=predictions_test,
        ax=axs[i, 0],
        label='Predicted',
        color='#4363d8'
    )
    axs[i, 0].set_title(f'{titles[i]} - Observed vs Predicted')
    axs[i, 0].set_xlabel('Sample Number')
    axs[i, 0].set_ylabel('Values (nm)')
    axs[i, 0].legend()

    # Plot observed vs predicted values
    residuals = y_test.values - predictions_test
    sns.scatterplot(
        x=y_test.values,
        y=predictions_test,
        hue=residuals,
        ax=axs[i, 1],
        palette='coolwarm'
    )
    axs[i, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
    axs[i, 1].set_title(f'{titles[i]} - Observed vs Predicted')
    axs[i, 1].set_xlabel('Observed Values')
    axs[i, 1].set_ylabel('Predicted Values')

plt.tight_layout()
plt.show()

# Plot decision tree for the first target
target = 'size_nm'
X = CsPbCl3_encoded.drop(target, axis=1)
y = CsPbCl3_encoded[target]

# Train the model for visualization
best_model.fit(X, y)

plt.figure(figsize=(15, 10))
plot_tree(best_model, feature_names=X.columns, filled=True, rounded=True)
plt.title(f'Decision Tree for {target}')
plt.show()
