## Support Vector Regression of the all  Outputs of the CsPbCI3QDs



# **Introduction**

This analysis focuses on modeling and predicting the properties of CsPbCl3 quantum dots using Support Vector Regression (SVR). The workflow includes data preprocessing, feature engineering, model training, evaluation, and result visualization.

---

# **1. Data Loading and Preprocessing**

The dataset is loaded from two sources:
- **Original Dataset**: `CsPbCl3_QDs.xlsx`
- **Modified Dataset**: `modified_data.xlsx`

### Key Preprocessing Steps:
- **One-Hot Encoding**: Converts categorical variables into numerical format.
- **Scaling**: Standardizes numerical features for better performance with SVR.
- **PCA (Principal Component Analysis)**: Reduces dimensionality while preserving variance.

---

# **2. Model Training and Evaluation**

### Support Vector Regression (SVR):
SVR is a powerful regression method that fits a hyperplane to predict continuous values while minimizing errors. The `rbf` kernel is used for its ability to model non-linear relationships.

### Evaluation Metrics:
- **R² (Coefficient of Determination)**: Measures the proportion of variance explained by the model.
- **RMSE (Root Mean Squared Error)**: Quantifies the average prediction error in the same units as the target variable.
- **MAE (Mean Absolute Error)**: Captures the average magnitude of errors.

### Hyperparameter Tuning:
GridSearchCV is used to optimize the SVR hyperparameters:
- `C`: Regularization parameter controlling model complexity.
- `gamma`: Defines the influence of single training samples.
- `epsilon`: Specifies the margin of tolerance for errors.

---

# **3. Results Visualization**

### **Observed vs Predicted Values**
Scatter plots compare observed and predicted values for each target variable (`size_nm`, `S_abs_nm_Y1`, `PL`).

### **Residual Analysis**
Residual histograms visualize the error distribution, helping identify biases or patterns.

### **Feature Importance**
Although SVR does not provide feature importances directly, external tools like SHAP or permutation importance can estimate the relative contribution of each feature.

---

# **4. Summary**

This analysis highlights the potential of SVR for predicting key properties of CsPbCl3 quantum dots. The model effectively captures complex relationships between input features and target variables. The evaluation metrics and visualizations provide insights into the model's performance.

---



## Support Vector Regression Analysis for CsPbCl3 Quantum Dots
### 1. Import Libraries

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Load and Preprocess Data

In [None]:
# File paths for datasets
file_path_original = "./CsPbCl3_QDs.xlsx"
file_path_modified = "./modified_data.xlsx"

# Load the dataset
CsPbCl3 = pd.read_excel(file_path_modified)

# Identify categorical columns and apply one-hot encoding
categorical_columns = CsPbCl3.select_dtypes(include=['object']).columns
one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = one_hot_encoder.fit_transform(CsPbCl3[categorical_columns])
one_hot_encoded_df = pd.DataFrame(
    one_hot_encoded, 
    columns=one_hot_encoder.get_feature_names_out(categorical_columns)
)

# Replace categorical columns with one-hot encoded columns
CsPbCl3_encoded = CsPbCl3.drop(categorical_columns, axis=1).join(one_hot_encoded_df)

# Define the target variables
targets = ['size_nm', 'S_abs_nm_Y1', 'PL']


### 3. Train SVR Model and Evaluate Performance

In [None]:
# Dictionary to store results and predictions
results = {}
predictions = {}

# Train SVR model for each target
for target in targets:
    print(f"Training SVR for target: {target}")
    
    # Prepare features and target variable
    X = CsPbCl3_encoded.drop(columns=targets)
    y = CsPbCl3_encoded[target]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Define a pipeline with scaling, PCA, and SVR
    pipe = make_pipeline(StandardScaler(), PCA(n_components=0.95), SVR(kernel='rbf'))
    
    # Define the parameter grid for GridSearchCV
    param_grid = {
        'svr__C': [0.1, 1, 10, 100],
        'svr__gamma': ['scale', 0.01, 0.1, 1],
        'svr__epsilon': [0.01, 0.1, 0.5]
    }
    
    # Perform grid search with cross-validation
    grid = GridSearchCV(pipe, param_grid, cv=5, scoring='r2', verbose=2)
    grid.fit(X_train, y_train)
    
    print(f"Best parameters for {target}: {grid.best_params_}")
    
    # Evaluate the best model
    best_model = grid.best_estimator_
    predictions_train = best_model.predict(X_train)
    predictions_test = best_model.predict(X_test)
    
    # Store results and predictions
    results[target] = {
        'Train R2': r2_score(y_train, predictions_train),
        'Test R2': r2_score(y_test, predictions_test),
        'Train RMSE': np.sqrt(mean_squared_error(y_train, predictions_train)),
        'Test RMSE': np.sqrt(mean_squared_error(y_test, predictions_test)),
        'Train MAE': mean_absolute_error(y_train, predictions_train),
        'Test MAE': mean_absolute_error(y_test, predictions_test)
    }
    predictions[target] = {'y_test': y_test, 'predictions_test': predictions_test}
    
    # Print results
    print(f"Performance for {target}:\n", results[target], "\n")


### 4. Visualize Results

In [None]:
# Visualize observed vs predicted values and residuals
fig, axs = plt.subplots(len(targets), 2, figsize=(12, 10))
titles = ['Size', '1 S abs', 'PL']

for i, target in enumerate(targets):
    y_test = predictions[target]['y_test']
    predictions_test = predictions[target]['predictions_test']
    
    # Plot (a): Observed vs Predicted Values
    sns.scatterplot(x=np.arange(len(y_test)), y=y_test, ax=axs[i, 0], label='Observed', color='red')
    sns.scatterplot(x=np.arange(len(predictions_test)), y=predictions_test, ax=axs[i, 0], label='Predicted', color='blue')
    axs[i, 0].set_title(f'{titles[i]}: Observed vs Predicted')
    axs[i, 0].set_xlabel('Sample Number')
    axs[i, 0].set_ylabel('Values')
    axs[i, 0].legend()
    
    # Plot (b): Residuals
    residuals = y_test - predictions_test
    sns.histplot(residuals, ax=axs[i, 1], kde=True, color='green')
    axs[i, 1].set_title(f'{titles[i]}: Residual Distribution')
    axs[i, 1].set_xlabel('Residuals')
    axs[i, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

### 5. Feature Importances (Placeholder for Explanation)

In [None]:
# Feature importance for SVR requires external libraries like SHAP or permutation importance
# Here, we include a placeholder for feature importances

# Example predefined feature importances for 'size_nm'
feature_importances_size_nm = {
    'Pb amount': 0.14,
    'Cs/Pb ratio': 0.13,
    'Temperature': 0.12,
    'ODE volume': 0.11,
    'Cs amount': 0.10,
    'Pb/(OA + OL)': 0.09,
    'Cl/Pb ratio': 0.08
}

# Create a DataFrame for feature importances
feature_importances_df = pd.DataFrame(
    list(feature_importances_size_nm.items()), 
    columns=['Feature', 'Importance']
)

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importances_df, palette='coolwarm')
plt.title('Feature Importances for Size (Placeholder)')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
