# **Data Analysis and Machine Learning Workflow for CsPbCl3 Quantum Dots**

This project analyzes and models data related to CsPbCl3 quantum dots using statistical and machine learning techniques. The workflow includes data visualization, correlation analysis, and predictive modeling using Random Forest regression. Below are the key steps and explanations.

---

## **1. Data Loading and Preprocessing**

The dataset is loaded from two Excel files:
1. `CsPbCl3_QDs.xlsx`: Original dataset.
2. `modified_data.xlsx`: Preprocessed dataset.

### Key Steps:
- **Categorical Encoding**: Categorical columns are encoded using one-hot encoding.
- **Feature Selection**: Relevant features and outputs are selected for analysis.
- **Handling Missing Values**: Missing values are imputed with column means during modeling.

---

## **2. Data Visualization**

### **Histograms**
Histograms display the distribution of key output variables (`size_nm`, `S_abs_nm_Y1`, `PL`).

- **Purpose**: Understand the spread and skewness of the data.
- **Colors**: Different colors represent each variable for better readability.

### **Boxplots**
Boxplots provide insights into the variability and presence of outliers in the data.

- **Purpose**: Detect outliers and compare data distributions for key variables.
- **Customization**: Bold labels and color-coded boxes enhance visualization.

### **Pearson Correlation Heatmap**
A heatmap is used to visualize the correlation between input features (`Cl_mmol`, `Cs_mmol`, `Oleylamine_OLA_ml`, etc.) and outputs (`size_nm`, `S_abs_nm_Y1`, `PL`).

- **Purpose**: Identify relationships between features and outputs.
- **Customization**: Includes custom labels and color gradients for better interpretability.

---

## **3. Machine Learning: Random Forest Regression**

Random Forest regression is applied to predict the output variables (`size_nm`, `S_abs_nm_Y1`, `PL`).

### Key Steps:
- **Data Splitting**: The dataset is split into training (70%) and testing (30%) sets.
- **Hyperparameter Tuning**: Grid search is used to find the optimal `max_features`.
- **Evaluation Metrics**:
  - **R² (Coefficient of Determination)**: Explains the variance in the target variable.
  - **RMSE (Root Mean Squared Error)**: Measures prediction error.
  - **MAE (Mean Absolute Error)**: Captures the average magnitude of prediction errors.

---

## **4. Results Visualization**

### **Observed vs Predicted Values**
Scatter plots compare observed and predicted values for each target variable (`size_nm`, `S_abs_nm_Y1`, `PL`).

- **Purpose**: Assess the alignment of predictions with actual values.
- **Customization**: Separate plots for each variable with labeled axes and legends.

### **Residual Analysis**
Residual plots and histograms show the distribution of prediction errors.

- **Purpose**: Identify any patterns or biases in prediction errors.
- **Customization**: Includes kernel density estimation (KDE) for better visualization.

---

## **5. Summary**

This analysis provides insights into the relationships between input features and output variables in CsPbCl3 quantum dots. The Random Forest model demonstrates its effectiveness in predicting key properties. Visualization aids in interpreting the model's performance and the data's characteristics.

---



### 1. Data Loading and Preprocessing

In [None]:
# Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RepeatedKFold, train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# File Paths
file_path_original = "./CsPbCl3_QDs.xlsx"
file_path_modified = "./modified_data.xlsx"

# Load and preprocess the dataset
def load_and_preprocess_data(file_path):
    """
    Load the dataset and preprocess it for analysis.
    """
    data = pd.read_excel(file_path)
    
    # Identify categorical columns
    categorical_columns = data.select_dtypes(include=['object']).columns
    
    # Apply one-hot encoding
    one_hot_encoder = OneHotEncoder(sparse_output=False)
    one_hot_encoded = one_hot_encoder.fit_transform(data[categorical_columns])
    one_hot_encoded_df = pd.DataFrame(
        one_hot_encoded, 
        columns=one_hot_encoder.get_feature_names_out(categorical_columns)
    )
    
    # Replace categorical columns with one-hot encoded columns
    data_encoded = data.drop(categorical_columns, axis=1)
    data_encoded = pd.concat([data_encoded, one_hot_encoded_df], axis=1)
    
    return data_encoded

# Load datasets
data_original = load_and_preprocess_data(file_path_original)
data_modified = load_and_preprocess_data(file_path_modified)


### 2. Data Visualization
### Histograms

In [None]:
# List of columns for histograms
columns = ['size_nm', 'S_abs_nm_Y1', 'PL']
colors = ['red', 'blue', 'green']

# Plot histograms
fig, axs = plt.subplots(1, 3, figsize=(13, 5))
for ax, column, color in zip(axs, columns, colors):
    ax.hist(data_modified[column], bins=20, color=color, alpha=0.7)
    ax.set_title(f'Distribution of {column}', fontsize=12)
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()


### Boxplots

In [None]:
# Boxplot visualization
fig, axs = plt.subplots(1, 3, figsize=(15, 6))
custom_labels = ['Size', '1 S abs', 'PL']

for ax, column, color in zip(axs, columns, colors):
    ax.boxplot(data_modified[column], patch_artist=True, boxprops=dict(facecolor=color))
    ax.set_title(f'Boxplot of {column}', fontsize=12, fontweight='bold')
    ax.set_xlabel(custom_labels[columns.index(column)], fontsize=11, fontweight='bold')
    ax.set_ylabel('Value', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()


### Pearson Correlation Heatmap

In [None]:
# Define feature matrix and outputs
FeatureMatrix = ['Cl_mmol', 'Cs_mmol', 'Oleylamine_OLA_ml', 'Oleicacid_OA_ml', 'Temperature']
Output = ['size_nm', 'S_abs_nm_Y1', 'PL']

# Select relevant columns
df_corr = data_modified[FeatureMatrix + Output]

# Calculate Pearson correlation
cor = df_corr.corr()

# Define custom labels
custom_labels = ['Cl amount', 'Cs amount', 'OLA volume', 'OA volume', 'Temperature', 'Size', '1 S abs', 'PL']

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(cor, annot=True, cmap='coolwarm', mask=np.triu(cor), xticklabels=custom_labels, yticklabels=custom_labels)
plt.title('Pearson Correlation Heatmap', fontsize=14)
plt.show()

## 3. Machine Learning: Random Forest Regression

In [None]:
# Machine Learning: Random Forest
def train_random_forest(data, target):
    """
    Train a Random Forest model for a given target.
    """
    X = data.drop(target, axis=1)
    y = data[target]
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Fill missing values
    X_train_filled = X_train.fillna(X_train.mean())
    X_test_filled = X_test.fillna(X_train.mean())
    y_train_filled = y_train.fillna(y_train.mean())
    
    # Random Forest with Grid Search
    param_grid = {'max_features': [3, 5, 7, 9, 11, 13, 15]}
    rf_model = RandomForestRegressor(n_estimators=500, random_state=42)
    grid_search = GridSearchCV(rf_model, param_grid, cv=RepeatedKFold(n_splits=5, n_repeats=3), verbose=1)
    grid_search.fit(X_train_filled, y_train_filled)
    
    # Evaluate performance
    predictions_train = grid_search.predict(X_train_filled)
    predictions_test = grid_search.predict(X_test_filled)
    
    metrics = {
        "Train R2": r2_score(y_train, predictions_train),
        "Test R2": r2_score(y_test, predictions_test),
        "Train RMSE": np.sqrt(mean_squared_error(y_train, predictions_train)),
        "Test RMSE": np.sqrt(mean_squared_error(y_test, predictions_test)),
        "Train MAE": mean_absolute_error(y_train, predictions_train),
        "Test MAE": mean_absolute_error(y_test, predictions_test)
    }
    
    return metrics, predictions_test, y_test

# Evaluate targets
targets = ['size_nm', 'S_abs_nm_Y1', 'PL']
results = {}

for target in targets:
    print(f"Training for target: {target}")
    metrics, predictions, actuals = train_random_forest(data_modified, target)
    results[target] = {"metrics": metrics, "predictions": predictions, "actuals": actuals}
    print(f"Metrics for {target}: {metrics}")


## 4. Visualization of Results

In [None]:
# Visualization of results
fig, axs = plt.subplots(3, 2, figsize=(15, 15))
titles = ['Size', '1 S abs', 'PL']

for i, target in enumerate(targets):
    predictions = results[target]['predictions']
    actuals = results[target]['actuals']
    
    # Plot (a): Observed vs Predicted
    sns.scatterplot(x=np.arange(len(actuals)), y=actuals, ax=axs[i, 0], label='Observed', color='red')
    sns.scatterplot(x=np.arange(len(predictions)), y=predictions, ax=axs[i, 0], label='Predicted', color='blue')
    axs[i, 0].set_title(f'{titles[i]} - Observed vs Predicted')
    
    # Plot (b): Residuals
    residuals = actuals - predictions
    sns.histplot(residuals, ax=axs[i, 1], kde=True, color='green')
    axs[i, 1].set_title(f'{titles[i]} - Residuals Distribution')

plt.tight_layout()
plt.show()
