## NND Method of the  of CsPbCI3 Codes

## Nearest Neighbour Distance (NND) Overview

Nearest Neighbour Distance (NND) is a technique used in data analysis to evaluate relationships between data points based on their proximity in the feature space. In machine learning, NND is commonly implemented using **K-Nearest Neighbors (KNN)** regression to predict target variables.

### Key Concepts

1. **Instance-Based Learning**:
   - NND relies on the training data to make predictions without explicitly building a model.
   - It is referred to as a "lazy learner" because computation happens during prediction.

2. **Similarity Measure**:
   - The method calculates the distance (e.g., Euclidean, Manhattan) between data points to identify the closest neighbors.

3. **Hyperparameters**:
   - **Number of Neighbors (`k`)**: Determines the number of closest neighbors used for predictions.
   - **Distance Metric**: Defines how distances are calculated (e.g., Euclidean, Minkowski).

4. **Advantages**:
   - Simple and interpretable approach.
   - Effective for datasets with localized patterns.

5. **Limitations**:
   - Computationally expensive for large datasets.
   - Sensitive to the choice of `k` and feature scaling.

### NND in This Project

In this analysis, the **K-Nearest Neighbors (KNN)** regression algorithm is used to predict the target variables (`size_nm`, `S_abs_nm_Y1`, and `PL`). The algorithm works as follows:
- The dataset is preprocessed to handle categorical and numerical features.
- Feature scaling is applied to ensure distances are computed correctly.
- The optimal number of neighbors (`k`) is set to `5` for this implementation.

### Evaluation Metrics

The model performance is evaluated using the following metrics:
- **RÂ² (Coefficient of Determination)**: Measures how well the model explains the variance in the target variable.
- **RMSE (Root Mean Squared Error)**: Quantifies the average magnitude of errors in predictions.
- **MAE (Mean Absolute Error)**: Captures the average absolute difference between observed and predicted values.

### Visualization

The results are visualized using:
1. Scatter plots to compare observed vs. predicted values.
2. Residual plots to assess the distribution of prediction errors.



In [None]:
# Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsRegressor
import matplotlib.pyplot as plt
import seaborn as sns

# File Paths
file_path_original = "./CsPbCl3_QDs.xlsx"
file_path_modified = "./modified_data.xlsx"

# Step 1: Load and Preprocess Data
def load_and_preprocess_data(file_path):
    """
    Load and preprocess the dataset.
    
    Parameters:
        file_path (str): Path to the Excel file.
    
    Returns:
        pd.DataFrame: Preprocessed dataset.
    """
    data = pd.read_excel(file_path)
    
    # Identify categorical columns
    categorical_columns = data.select_dtypes(include=['object']).columns
    
    # Apply one-hot encoding to categorical columns
    one_hot_encoder = OneHotEncoder(sparse_output=False)
    one_hot_encoded = one_hot_encoder.fit_transform(data[categorical_columns])
    one_hot_encoded_df = pd.DataFrame(
        one_hot_encoded, 
        columns=one_hot_encoder.get_feature_names_out(categorical_columns)
    )
    
    # Replace categorical columns with one-hot encoded columns
    data_encoded = data.drop(categorical_columns, axis=1)
    data_encoded = pd.concat([data_encoded, one_hot_encoded_df], axis=1)
    return data_encoded

# Load datasets
data_original = load_and_preprocess_data(file_path_original)
data_modified = load_and_preprocess_data(file_path_modified)

# Step 2: Prepare Data for Machine Learning
def prepare_ml_data(data, target_column):
    """
    Prepare the dataset for machine learning.
    
    Parameters:
        data (pd.DataFrame): Dataset.
        target_column (str): Target variable.
    
    Returns:
        Tuple: Features (X), target (y), and train-test splits (X_train, X_test, y_train, y_test).
    """
    X = data.drop(target_column, axis=1)
    y = data[target_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    return X_train, X_test, y_train, y_test

# Step 3: Train Nearest Neighbour Distance Model
def train_nnd_model(X_train, y_train, X_test, y_test, n_neighbors=5):
    """
    Train a KNN regressor on the dataset.
    
    Parameters:
        X_train (pd.DataFrame): Training features.
        y_train (pd.Series): Training target.
        X_test (pd.DataFrame): Testing features.
        y_test (pd.Series): Testing target.
        n_neighbors (int): Number of nearest neighbors.
    
    Returns:
        dict: Model predictions and performance metrics.
    """
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train.fillna(X_train.mean()))
    X_test_scaled = scaler.transform(X_test.fillna(X_train.mean()))
    
    # Define the KNN regressor
    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    knn.fit(X_train_scaled, y_train.fillna(y_train.mean()))
    
    # Predictions
    predictions_train = knn.predict(X_train_scaled)
    predictions_test = knn.predict(X_test_scaled)
    
    # Performance metrics
    metrics = {
        "Train R2": r2_score(y_train.fillna(y_train.mean()), predictions_train),
        "Train RMSE": np.sqrt(mean_squared_error(y_train.fillna(y_train.mean()), predictions_train)),
        "Train MAE": mean_absolute_error(y_train.fillna(y_train.mean()), predictions_train),
        "Test R2": r2_score(y_test, predictions_test),
        "Test RMSE": np.sqrt(mean_squared_error(y_test, predictions_test)),
        "Test MAE": mean_absolute_error(y_test, predictions_test)
    }
    return {
        "predictions_train": predictions_train,
        "predictions_test": predictions_test,
        "metrics": metrics
    }

# Step 4: Evaluate Targets
targets = ['size_nm', 'S_abs_nm_Y1', 'PL']
results = {}

for target in targets:
    print(f"Evaluating target: {target}")
    X_train, X_test, y_train, y_test = prepare_ml_data(data_modified, target)
    results[target] = train_nnd_model(X_train, y_train, X_test, y_test, n_neighbors=5)

    # Print metrics
    print(f"Metrics for {target}:")
    for metric, value in results[target]["metrics"].items():
        print(f"  {metric}: {value:.4f}")
    print("\n")

# Step 5: Visualization
fig, axs = plt.subplots(3, 2, figsize=(15, 15))

for i, target in enumerate(targets):
    y_test = results[target]['predictions_test']
    predictions_test = results[target]['predictions_test']
    
    # Plot 1: Observed vs Predicted
    sns.scatterplot(x=np.arange(len(y_test)), y=y_test, ax=axs[i, 0], label='Observed', color='red')
    sns.scatterplot(x=np.arange(len(predictions_test)), y=predictions_test, ax=axs[i, 0], label='Predicted', color='blue')
    axs[i, 0].set_title(f'{target} - Observed vs Predicted')
    
    # Plot 2: Residuals
    residuals = y_test - predictions_test
    sns.histplot(residuals, ax=axs[i, 1], kde=True, color='green')
    axs[i, 1].set_title(f'{target} - Residuals Distribution')

plt.tight_layout()
plt.show()
