# ESG and Financial Performance Analysis

This notebook analyzes the relationship between ESG (Environmental, Social, and Governance) scores and financial performance metrics using the "ESG and Financial Performance" dataset. We will explore, preprocess, visualize the data, and then apply various supervised and unsupervised machine learning algorithms implemented in the `Supervised_Learning` and `Unsupervised_Learning` directories.

## Contents:
1.  **Data Loading and Initial Exploration:** Load the dataset and perform basic checks (shape, info, missing values, summary statistics).
2.  **Data Preprocessing and Cleaning:** Handle missing values, encode categorical features, and scale numerical features using the provided utility function.
3.  **Exploratory Data Visualization:** Visualize data distributions and correlations.
4.  **Data Preparation for Machine Learning:** Split data into features (X) and target (y) for regression and classification tasks, and perform train/test splits.
5.  **Supervised Learning:**
    * Regression Task (Predicting ProfitMargin)
        * Linear Regression
        * Decision Tree Regressor
        * K-Nearest Neighbors (KNN) Regressor
        * Random Forest Regressor
        * Neural Network (Regression)
        * Gradient Boosting Regressor
        * Regression Model Comparison
    * Classification Task (Predicting High/Low ESG_Overall)
        * Perceptron
        * Logistic Regression
        * K-Nearest Neighbors (KNN) Classifier
        * Decision Tree Classifier
        * Random Forest Classifier
        * Neural Network (Classification)
        * AdaBoost Classifier
        * Classification Model Comparison
6.  **Unsupervised Learning:**
    * Principal Component Analysis (PCA)
    * K-Means Clustering
    * DBSCAN Clustering
    * Clustering Model Comparison
    * Singular Value Decomposition (SVD) for Compression (Demonstration)
7.  **Conclusion:** Summarize findings.

In [None]:
# Import necessary standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    mean_squared_error, r2_score, 
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, 
    silhouette_score
)
from sklearn.decomposition import PCA as SklearnPCA # For comparison/visualization
from sklearn.cluster import DBSCAN as SklearnDBSCAN # For comparison/visualization
from PIL import Image # For SVD section example
import requests
from io import BytesIO

# Import custom utilities and algorithms
from utils.data_preprocessing import preprocess_esg_data, create_feature_target_split
from utils.data_visualization import (
    plot_esg_distributions,
    plot_correlation_matrix,
    plot_financial_metrics,
    plot_esg_vs_financial,
    plot_pca_results,
    plot_clusters,
    plot_regression_results,
    plot_classification_results,
    plot_model_comparison # Added for comparing models
)

# Supervised Learning Algorithms
from Supervised_Learning.perceptron import Perceptron
from Supervised_Learning.linear_regression import LinearRegression
from Supervised_Learning.logistic_regression import LogisticRegression
from Supervised_Learning.neural_network import NeuralNetwork
from Supervised_Learning.knn import KNNClassifier, KNNRegressor
from Supervised_Learning.decision_tree import DecisionTreeClassifier, DecisionTreeRegressor
from Supervised_Learning.random_forest import RandomForestClassifier, RandomForestRegressor
from Supervised_Learning.ensemble_methods import AdaBoostClassifier, GradientBoostingRegressor

# Unsupervised Learning Algorithms
from Unsupervised_Learning.kmeans import KMeans
from Unsupervised_Learning.dbscan import DBSCAN
from Unsupervised_Learning.pca import PCA
from Unsupervised_Learning.svd_compression import SVDCompression

# Set visualization style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("Libraries and modules imported.")

## 1. Data Loading and Initial Exploration

In [None]:
# Define the relative path to the data file
# Assumes the notebook is running in the root directory 'ML_Course_Project'
file_path = "./data/company_esg_financial_dataset.csv"

df_original = None # Initialize df to None

try:
    # Check if the file exists at the specified path
    if os.path.exists(file_path):
        # Load the CSV file into a pandas DataFrame
        df_original = pd.read_csv(file_path)
        print(f"Successfully loaded data from: {file_path}")
    else:
        print(f"Error: File not found at the specified path: {file_path}")
        print("Please ensure the 'data' directory exists and the file 'company_esg_financial_dataset.csv' is inside it.")

except FileNotFoundError:
    print(f"Error: File not found at path: {file_path}")
    print("Please double-check the path and file name.")
except pd.errors.EmptyDataError:
    print(f"Error: The file at {file_path} is empty.")
except Exception as e:
    print(f"An error occurred while loading the file: {e}")

# Proceed with analysis if the DataFrame was loaded successfully
if df_original is not None:
    print(f"\nDataset loaded successfully.")
    print(f"Shape of the DataFrame: {df_original.shape}")
    print("\nFirst 5 records:")
    display(df_original.head())
else:
    print("\nFailed to load dataset. Please check the file path and integrity.")

In [None]:
# Get basic information about the dataset
if df_original is not None:
    print("\nDataset Information:")
    df_original.info()

In [None]:
# Check for missing values
if df_original is not None:
    missing_values = df_original.isnull().sum()
    missing_percentage = (missing_values / len(df_original)) * 100
    missing_df = pd.DataFrame({
        'Missing Values': missing_values,
        'Percentage': missing_percentage
    })
    print("\nMissing Values Analysis:")
    display(missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False))

*Observation:* The `GrowthRate` column has 1000 missing values (9.1%). All other columns are complete.

In [None]:
# Summary statistics of numerical columns
if df_original is not None:
    print("\nSummary Statistics:")
    display(df_original.describe())

In [None]:
# Check unique values in categorical columns
if df_original is not None:
    categorical_cols = df_original.select_dtypes(include=['object']).columns
    print("\nUnique values in categorical columns:")
    for col in categorical_cols:
        print(f"--- {col} ({df_original[col].nunique()} unique) ---")
        if df_original[col].nunique() < 15:
             print(df_original[col].value_counts())
        else:
             print(f"(Too many unique values to display: {df_original[col].nunique()})\n")

## 2. Data Preprocessing and Cleaning

We will use the `preprocess_esg_data` function from `utils.data_preprocessing`. This function handles:
1.  **Missing Value Imputation:** Uses median for numerical features (`GrowthRate`) and mode for categorical features (though none are missing here).
2.  **Categorical Encoding:** Converts `Industry` and `Region` into numerical representations using Label Encoding.
3.  **Feature Scaling:** Standardizes numerical features to have zero mean and unit variance using `StandardScaler`.
4.  **Outlier Handling (Optional in function):** The provided utility function includes an optional IQR-based outlier handling step, which we will use here for robustness, although the impact might be minimal depending on the algorithms.

In [None]:
if df_original is not None:
    # Apply preprocessing
    # Note: We drop CompanyID and CompanyName as they are identifiers and not typically used as features
    df_processed = preprocess_esg_data(df_original.drop(columns=['CompanyID', 'CompanyName']))
    
    print("Dataset shape after preprocessing:", df_processed.shape)
    print("\nFirst 5 rows of preprocessed data:")
    display(df_processed.head())
    
    print("\nChecking for missing values after preprocessing:")
    print(df_processed.isnull().sum().sum()) # Should be 0
    
    print("\nData types after preprocessing:")
    df_processed.info()

## 3. Exploratory Data Visualization (on Processed Data)

In [None]:
# Visualize distributions of key processed features
if 'df_processed' in locals():
    plot_cols = ['ESG_Overall', 'ProfitMargin', 'MarketCap', 'Revenue', 'Industry', 'Region']
    plt.figure(figsize=(15, 10))
    for i, col in enumerate(plot_cols):
        plt.subplot(2, 3, i + 1)
        sns.histplot(df_processed[col], kde=True)
        plt.title(f'Processed Distribution of {col}')
    plt.tight_layout()
    plt.show()

In [None]:
# Correlation heatmap for processed numerical columns
if 'df_processed' in locals():
    plot_correlation_matrix(df_processed)

## 4. Data Preparation for Machine Learning

We'll define features (X) and targets (y) for our supervised learning tasks.

**Tasks:**
1.  **Regression:** Predict `ProfitMargin`.
2.  **Classification:** Predict whether `ESG_Overall` score is 'High' (above median) or 'Low' (below or equal to median).

In [None]:
if 'df_processed' in locals():
    # --- Regression Task --- 
    target_reg = 'ProfitMargin'
    # Use all other columns as features, except potential identifiers if they were kept
    features_reg = df_processed.drop(columns=[target_reg]).columns.tolist()
    
    X_reg = df_processed[features_reg]
    y_reg = df_processed[target_reg]
    
    # Train-test split for regression
    X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
        X_reg, y_reg, test_size=0.2, random_state=42
    )
    print(f"Regression - Features shape: {X_reg.shape}, Target shape: {y_reg.shape}")
    print(f"Train shapes: {X_reg_train.shape}, {y_reg_train.shape}")
    print(f"Test shapes: {X_reg_test.shape}, {y_reg_test.shape}")

    # --- Classification Task --- 
    # Create a binary target based on the median ESG_Overall score
    # Note: We use the *original* non-scaled ESG_Overall to determine the median for interpretability
    if 'ESG_Overall' in df_original.columns:
        median_esg = df_original['ESG_Overall'].median()
        df_processed['ESG_Category'] = (df_processed['ESG_Overall'] > df_processed['ESG_Overall'].median()).astype(int) # Use median of scaled data for split
        # Alternatively, use original median on scaled data: df_processed['ESG_Category'] = (df_original['ESG_Overall'] > median_esg).astype(int)
        
        target_clf = 'ESG_Category'
        # Use all columns except the target and the original ESG score it's derived from
        features_clf = df_processed.drop(columns=[target_clf, 'ESG_Overall']).columns.tolist()
        
        X_clf = df_processed[features_clf]
        y_clf = df_processed[target_clf]
        
        # Train-test split for classification
        X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(
            X_clf, y_clf, test_size=0.2, random_state=42, stratify=y_clf # Stratify for classification
        )
        print(f"\nClassification - Features shape: {X_clf.shape}, Target shape: {y_clf.shape}")
        print(f"Train shapes: {X_clf_train.shape}, {y_clf_train.shape}")
        print(f"Test shapes: {X_clf_test.shape}, {y_clf_test.shape}")
        print(f"ESG Category distribution (0=Low, 1=High):\n{df_processed['ESG_Category'].value_counts(normalize=True)}")
    else:
        print("\nWarning: 'ESG_Overall' column not found in original data for classification task setup.")
        X_clf_train, X_clf_test, y_clf_train, y_clf_test = [None]*4

    # --- Data for Unsupervised Learning --- 
    # Typically use all relevant features (excluding target variables if any)
    # For clustering/PCA, we might use a subset or all processed features
    X_unsupervised = df_processed[features_reg] # Using regression features as an example
    print(f"\nUnsupervised Learning - Data shape: {X_unsupervised.shape}")

else:
    print("\nDataFrame df_processed not found. Skipping data preparation.")

## 5. Supervised Learning

We will now apply the implemented supervised learning algorithms to the prepared datasets.

### 5.1 Regression Task (Predicting ProfitMargin)

#### 5.1.1 Linear Regression

In [None]:
if 'X_reg_train' in locals() and X_reg_train is not None:
    # Using Normal Equation (default)
    lr_normal = LinearRegression(method='normal_equation')
    lr_normal.fit(X_reg_train, y_reg_train)
    y_reg_pred_normal = lr_normal.predict(X_reg_test)
    mse_normal = mean_squared_error(y_reg_test, y_reg_pred_normal)
    r2_normal = r2_score(y_reg_test, y_reg_pred_normal)
    print("--- Linear Regression (Normal Equation) ---")
    print(f"Mean Squared Error: {mse_normal:.4f}")
    print(f"R^2 Score: {r2_normal:.4f}")

    # Using Gradient Descent
    lr_gd = LinearRegression(method='gradient_descent', learning_rate=0.01, n_iterations=1000)
    lr_gd.fit(X_reg_train.values, y_reg_train.values) # .values might be needed if fit expects numpy
    y_reg_pred_gd = lr_gd.predict(X_reg_test.values)
    mse_gd = mean_squared_error(y_reg_test, y_reg_pred_gd)
    r2_gd = r2_score(y_reg_test, y_reg_pred_gd)
    print("\n--- Linear Regression (Gradient Descent) ---")
    print(f"Mean Squared Error: {mse_gd:.4f}")
    print(f"R^2 Score: {r2_gd:.4f}")
    
    # Store results for comparison
    regression_results = {'Linear Regression (Normal Eq.)': {'MSE': mse_normal, 'R2': r2_normal},
                         'Linear Regression (GD)': {'MSE': mse_gd, 'R2': r2_gd}}
    
    # Visualization (Actual vs Predicted for Normal Equation)
    plot_regression_results(y_reg_test, y_reg_pred_normal, 'Linear Regression (Normal Eq.)')
else:
    print("Regression data not prepared. Skipping Linear Regression.")

#### 5.1.2 Decision Tree Regressor

In [None]:
if 'X_reg_train' in locals() and X_reg_train is not None:
    dt_reg = DecisionTreeRegressor(max_depth=10, min_samples_split=10, random_state=42)
    dt_reg.fit(X_reg_train, y_reg_train)
    y_reg_pred_dt = dt_reg.predict(X_reg_test)
    mse_dt = mean_squared_error(y_reg_test, y_reg_pred_dt)
    r2_dt = r2_score(y_reg_test, y_reg_pred_dt)
    print("--- Decision Tree Regressor ---")
    print(f"Mean Squared Error: {mse_dt:.4f}")
    print(f"R^2 Score: {r2_dt:.4f}")
    
    regression_results['Decision Tree Regressor'] = {'MSE': mse_dt, 'R2': r2_dt}
    plot_regression_results(y_reg_test, y_reg_pred_dt, 'Decision Tree Regressor')
else:
    print("Regression data not prepared. Skipping Decision Tree Regressor.")

#### 5.1.3 K-Nearest Neighbors (KNN) Regressor

In [None]:
if 'X_reg_train' in locals() and X_reg_train is not None:
    knn_reg = KNNRegressor(n_neighbors=7, weights='distance') 
    knn_reg.fit(X_reg_train, y_reg_train)
    y_reg_pred_knn = knn_reg.predict(X_reg_test)
    mse_knn = mean_squared_error(y_reg_test, y_reg_pred_knn)
    r2_knn = r2_score(y_reg_test, y_reg_pred_knn)
    print("--- KNN Regressor ---")
    print(f"Mean Squared Error: {mse_knn:.4f}")
    print(f"R^2 Score: {r2_knn:.4f}")

    regression_results['KNN Regressor'] = {'MSE': mse_knn, 'R2': r2_knn}
    plot_regression_results(y_reg_test, y_reg_pred_knn, 'KNN Regressor')
else:
    print("Regression data not prepared. Skipping KNN Regressor.")

#### 5.1.4 Random Forest Regressor

In [None]:
if 'X_reg_train' in locals() and X_reg_train is not None:
    rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=10, random_state=42)
    rf_reg.fit(X_reg_train, y_reg_train)
    y_reg_pred_rf = rf_reg.predict(X_reg_test)
    mse_rf = mean_squared_error(y_reg_test, y_reg_pred_rf)
    r2_rf = r2_score(y_reg_test, y_reg_pred_rf)
    print("--- Random Forest Regressor ---")
    print(f"Mean Squared Error: {mse_rf:.4f}")
    print(f"R^2 Score: {r2_rf:.4f}")

    regression_results['Random Forest Regressor'] = {'MSE': mse_rf, 'R2': r2_rf}
    plot_regression_results(y_reg_test, y_reg_pred_rf, 'Random Forest Regressor')
else:
    print("Regression data not prepared. Skipping Random Forest Regressor.")

#### 5.1.5 Neural Network (Regression)

In [None]:
if 'X_reg_train' in locals() and X_reg_train is not None:
    # For regression, NN output layer should have 1 neuron and linear activation (or no activation)
    # Adjust the NN implementation or use appropriate loss (e.g., MSE)
    # Assuming the current NN implementation is for classification, we might need modifications
    # For demonstration, let's assume it can be adapted (or skip if not directly applicable)
    # If adapting: Output layer size = 1, loss = MSE, no activation on output
    
    # NOTE: The provided NN class seems geared towards classification (sigmoid/softmax). 
    # Adapting it fully for regression is beyond copy-pasting. We'll skip direct application here.
    print("Skipping Neural Network for regression as the provided class seems classification-focused.")
    # nn_reg = NeuralNetwork(hidden_layer_size=64, activation='relu', learning_rate=0.001, n_iterations=500, batch_size=64, random_state=42)
    # # Need to adapt fit/predict/loss for regression if possible, or implement a separate RegressionNN
    # try:
    #     # Assuming adaptation for regression (e.g., linear output, MSE loss)
    #     nn_reg.fit(X_reg_train.values, y_reg_train.values.reshape(-1, 1)) # Reshape y for potential NN structure
    #     y_reg_pred_nn = nn_reg.predict(X_reg_test.values).flatten() # Flatten if predict returns column vector
    #     mse_nn = mean_squared_error(y_reg_test, y_reg_pred_nn)
    #     r2_nn = r2_score(y_reg_test, y_reg_pred_nn)
    #     print("\n--- Neural Network (Regression) ---")
    #     print(f"Mean Squared Error: {mse_nn:.4f}")
    #     print(f"R^2 Score: {r2_nn:.4f}")
    #     regression_results['Neural Network Regressor'] = {'MSE': mse_nn, 'R2': r2_nn}
    #     plot_regression_results(y_reg_test, y_reg_pred_nn, 'Neural Network Regressor')
    # except Exception as e:
    #     print(f"Could not run Neural Network for regression: {e}")
else:
     print("Regression data not prepared. Skipping Neural Network Regressor.")

#### 5.1.6 Gradient Boosting Regressor

In [None]:
if 'X_reg_train' in locals() and X_reg_train is not None:
    gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
    gb_reg.fit(X_reg_train, y_reg_train)
    y_reg_pred_gb = gb_reg.predict(X_reg_test)
    mse_gb = mean_squared_error(y_reg_test, y_reg_pred_gb)
    r2_gb = r2_score(y_reg_test, y_reg_pred_gb)
    print("--- Gradient Boosting Regressor ---")
    print(f"Mean Squared Error: {mse_gb:.4f}")
    print(f"R^2 Score: {r2_gb:.4f}")

    regression_results['Gradient Boosting Regressor'] = {'MSE': mse_gb, 'R2': r2_gb}
    plot_regression_results(y_reg_test, y_reg_pred_gb, 'Gradient Boosting Regressor')
else:
    print("Regression data not prepared. Skipping Gradient Boosting Regressor.")

#### 5.1.7 Regression Model Comparison

In [None]:
if 'regression_results' in locals() and regression_results:
    plot_model_comparison(regression_results, 'Regression Model Comparison (Lower MSE is Better)')
    plot_model_comparison(regression_results, 'Regression Model Comparison (Higher R2 is Better)', metric='R2', higher_is_better=True)
else:
    print("No regression results to compare.")

### 5.2 Classification Task (Predicting High/Low ESG_Overall)

#### 5.2.1 Perceptron

In [None]:
# Note: Perceptron expects labels -1 and 1. Our target is 0 and 1.
# We need to map y_clf_train and y_clf_test.
if 'X_clf_train' in locals() and X_clf_train is not None:
    y_clf_train_perceptron = np.where(y_clf_train == 0, -1, 1)
    y_clf_test_perceptron = np.where(y_clf_test == 0, -1, 1)

    perceptron = Perceptron(learning_rate=0.01, n_iterations=1000, random_state=42)
    perceptron.fit(X_clf_train, y_clf_train_perceptron)
    y_clf_pred_perceptron_mapped = perceptron.predict(X_clf_test)
    
    # Map predictions back to 0/1 for standard metrics
    y_clf_pred_perceptron = np.where(y_clf_pred_perceptron_mapped == -1, 0, 1)
    
    acc_perceptron = accuracy_score(y_clf_test, y_clf_pred_perceptron)
    f1_perceptron = f1_score(y_clf_test, y_clf_pred_perceptron)
    print("--- Perceptron ---")
    print(f"Accuracy: {acc_perceptron:.4f}")
    print(f"F1 Score: {f1_perceptron:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_perceptron))

    # Store results for comparison
    classification_results = {'Perceptron': {'Accuracy': acc_perceptron, 'F1': f1_perceptron}}
    
    # Plot results (e.g., confusion matrix) - Need a function for this
    plot_classification_results(y_clf_test, y_clf_pred_perceptron, 'Perceptron', perceptron.errors)
else:
    print("Classification data not prepared. Skipping Perceptron.")

#### 5.2.2 Logistic Regression

In [None]:
if 'X_clf_train' in locals() and X_clf_train is not None:
    log_reg = LogisticRegression(learning_rate=0.1, n_iterations=1000, random_state=42)
    # Logistic Regression implementation expects y in {0, 1}
    log_reg.fit(X_clf_train.values, y_clf_train.values) # Use .values if needed
    y_clf_pred_logreg = log_reg.predict(X_clf_test.values)
    
    acc_logreg = accuracy_score(y_clf_test, y_clf_pred_logreg)
    f1_logreg = f1_score(y_clf_test, y_clf_pred_logreg)
    print("--- Logistic Regression ---")
    print(f"Accuracy: {acc_logreg:.4f}")
    print(f"F1 Score: {f1_logreg:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_logreg))

    classification_results['Logistic Regression'] = {'Accuracy': acc_logreg, 'F1': f1_logreg}
    # Pass log_reg.costs if available and adaptable for plotting
    plot_classification_results(y_clf_test, y_clf_pred_logreg, 'Logistic Regression', log_reg.costs if hasattr(log_reg, 'costs') else None)
else:
    print("Classification data not prepared. Skipping Logistic Regression.")

#### 5.2.3 K-Nearest Neighbors (KNN) Classifier

In [None]:
if 'X_clf_train' in locals() and X_clf_train is not None:
    knn_clf = KNNClassifier(n_neighbors=7, weights='distance')
    knn_clf.fit(X_clf_train, y_clf_train)
    y_clf_pred_knn = knn_clf.predict(X_clf_test)
    
    acc_knn = accuracy_score(y_clf_test, y_clf_pred_knn)
    f1_knn = f1_score(y_clf_test, y_clf_pred_knn)
    print("--- KNN Classifier ---")
    print(f"Accuracy: {acc_knn:.4f}")
    print(f"F1 Score: {f1_knn:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_knn))

    classification_results['KNN Classifier'] = {'Accuracy': acc_knn, 'F1': f1_knn}
    plot_classification_results(y_clf_test, y_clf_pred_knn, 'KNN Classifier')
else:
    print("Classification data not prepared. Skipping KNN Classifier.")

#### 5.2.4 Decision Tree Classifier

In [None]:
if 'X_clf_train' in locals() and X_clf_train is not None:
    dt_clf = DecisionTreeClassifier(max_depth=10, min_samples_split=10, criterion='gini', random_state=42)
    dt_clf.fit(X_clf_train, y_clf_train)
    y_clf_pred_dt = dt_clf.predict(X_clf_test)
    
    acc_dt = accuracy_score(y_clf_test, y_clf_pred_dt)
    f1_dt = f1_score(y_clf_test, y_clf_pred_dt)
    print("--- Decision Tree Classifier ---")
    print(f"Accuracy: {acc_dt:.4f}")
    print(f"F1 Score: {f1_dt:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_dt))

    classification_results['Decision Tree Classifier'] = {'Accuracy': acc_dt, 'F1': f1_dt}
    plot_classification_results(y_clf_test, y_clf_pred_dt, 'Decision Tree Classifier')
else:
    print("Classification data not prepared. Skipping Decision Tree Classifier.")

#### 5.2.5 Random Forest Classifier

In [None]:
if 'X_clf_train' in locals() and X_clf_train is not None:
    rf_clf = RandomForestClassifier(n_estimators=100, max_depth=10, min_samples_split=10, random_state=42)
    rf_clf.fit(X_clf_train, y_clf_train)
    y_clf_pred_rf = rf_clf.predict(X_clf_test)
    
    acc_rf = accuracy_score(y_clf_test, y_clf_pred_rf)
    f1_rf = f1_score(y_clf_test, y_clf_pred_rf)
    print("--- Random Forest Classifier ---")
    print(f"Accuracy: {acc_rf:.4f}")
    print(f"F1 Score: {f1_rf:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_rf))

    classification_results['Random Forest Classifier'] = {'Accuracy': acc_rf, 'F1': f1_rf}
    plot_classification_results(y_clf_test, y_clf_pred_rf, 'Random Forest Classifier')
else:
    print("Classification data not prepared. Skipping Random Forest Classifier.")

#### 5.2.6 Neural Network (Classification)

In [None]:
if 'X_clf_train' in locals() and X_clf_train is not None:
    # Assuming binary classification (0/1 target)
    # Output layer size = 1 (for sigmoid) or 2 (for softmax/one-hot)
    # Let's assume the NN implementation handles binary with sigmoid output
    nn_clf = NeuralNetwork(hidden_layer_size=64, activation='relu', learning_rate=0.01, 
                           n_iterations=500, batch_size=64, random_state=42)
    
    # NN might expect y as a column vector
    y_train_nn = y_clf_train.values.reshape(-1, 1)
    
    nn_clf.fit(X_clf_train.values, y_train_nn)
    y_clf_pred_nn = nn_clf.predict(X_clf_test.values) # predict usually returns class labels
    
    acc_nn = accuracy_score(y_clf_test, y_clf_pred_nn)
    f1_nn = f1_score(y_clf_test, y_clf_pred_nn)
    print("--- Neural Network Classifier ---")
    print(f"Accuracy: {acc_nn:.4f}")
    print(f"F1 Score: {f1_nn:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_nn))

    classification_results['Neural Network Classifier'] = {'Accuracy': acc_nn, 'F1': f1_nn}
    # Pass nn_clf.losses if available and adaptable for plotting
    plot_classification_results(y_clf_test, y_clf_pred_nn, 'Neural Network Classifier', nn_clf.losses if hasattr(nn_clf, 'losses') else None)
else:
    print("Classification data not prepared. Skipping Neural Network Classifier.")

#### 5.2.7 AdaBoost Classifier

In [None]:
if 'X_clf_train' in locals() and X_clf_train is not None:
    # AdaBoost implementation might require specific base estimators (like Decision Stumps)
    # Using the DecisionTreeClassifier from the library as base (depth=1 for stump)
    stump = DecisionTreeClassifier(max_depth=1)
    ada = AdaBoostClassifier(base_estimator=stump, n_estimators=50, learning_rate=1.0, random_state=42)
    
    # AdaBoost implementation might expect y in {0, 1} or {-1, 1}. Assuming {0, 1} based on code.
    ada.fit(X_clf_train, y_clf_train)
    y_clf_pred_ada = ada.predict(X_clf_test)
    
    acc_ada = accuracy_score(y_clf_test, y_clf_pred_ada)
    f1_ada = f1_score(y_clf_test, y_clf_pred_ada)
    print("--- AdaBoost Classifier ---")
    print(f"Accuracy: {acc_ada:.4f}")
    print(f"F1 Score: {f1_ada:.4f}")
    print("Classification Report:\n", classification_report(y_clf_test, y_clf_pred_ada))

    classification_results['AdaBoost Classifier'] = {'Accuracy': acc_ada, 'F1': f1_ada}
    plot_classification_results(y_clf_test, y_clf_pred_ada, 'AdaBoost Classifier')
else:
    print("Classification data not prepared. Skipping AdaBoost Classifier.")

#### 5.2.8 Classification Model Comparison

In [None]:
if 'classification_results' in locals() and classification_results:
    plot_model_comparison(classification_results, 'Classification Model Comparison (Higher Accuracy is Better)', metric='Accuracy', higher_is_better=True)
    plot_model_comparison(classification_results, 'Classification Model Comparison (Higher F1 is Better)', metric='F1', higher_is_better=True)
else:
    print("No classification results to compare.")

## 6. Unsupervised Learning

We'll apply unsupervised algorithms for dimensionality reduction and clustering using the processed features (`X_unsupervised`).

### 6.1 Principal Component Analysis (PCA)

In [None]:
if 'X_unsupervised' in locals():
    # Fit PCA to retain 95% of variance
    pca = PCA(n_components=0.95)
    X_pca = pca.fit_transform(X_unsupervised)
    
    print("--- Principal Component Analysis (PCA) ---")
    print(f"Original number of features: {X_unsupervised.shape[1]}")
    print(f"Number of components selected by PCA (to retain 95% variance): {pca.n_components_}")
    print(f"Shape of transformed data: {X_pca.shape}")
    print(f"Explained variance ratio per component: \n{pca.explained_variance_ratio_}")
    print(f"Cumulative explained variance: \n{np.cumsum(pca.explained_variance_ratio_)}")
    
    # Plot explained variance
    plot_pca_results(pca)
    
    # Visualize the first two principal components (optional: color by a category like Industry)
    plt.figure(figsize=(10, 7))
    # Use original categorical data for coloring if available and makes sense
    if 'Industry' in df_original.columns:
         unique_industries = df_original['Industry'].unique()
         colors = plt.cm.get_cmap('viridis', len(unique_industries))
         industry_map = {industry: i for i, industry in enumerate(unique_industries)}
         scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df_original['Industry'].map(industry_map), cmap=colors, alpha=0.6)
         plt.colorbar(scatter, label='Industry (Encoded)') # Adjust label if needed
    else:
         plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.6)

    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.title('PCA: Data projected onto first two components')
    plt.grid(True)
    plt.show()

else:
    print("Unsupervised data not prepared. Skipping PCA.")

### 6.2 K-Means Clustering

In [None]:
if 'X_pca' in locals(): # Use PCA-transformed data for easier visualization/clustering
    n_clusters_kmeans = 5 # Example number of clusters
    kmeans = KMeans(n_clusters=n_clusters_kmeans, random_state=42)
    kmeans.fit(X_pca) # Fit on PCA data (first few components)
    kmeans_labels = kmeans.labels_
    kmeans_silhouette = silhouette_score(X_pca, kmeans_labels)

    print("--- K-Means Clustering ---")
    print(f"Number of clusters: {n_clusters_kmeans}")
    print(f"Inertia (Sum of squared distances): {kmeans.inertia_:.2f}")
    print(f"Silhouette Score: {kmeans_silhouette:.4f}")
    
    # Plot K-Means clusters using the first two PCA components
    plot_clusters(X_pca, kmeans_labels, kmeans.centroids_, 'K-Means Clustering Results (on PCA data)')
else:
     print("PCA data not available. Skipping K-Means.")

### 6.3 DBSCAN Clustering

In [None]:
if 'X_pca' in locals():
    # DBSCAN parameters often require tuning. Using example values.
    # We'll use the implementation from the library for DBSCAN as it's more complex.
    dbscan = DBSCAN(eps=0.5, min_samples=5) 
    dbscan.fit(X_pca) # Fit on PCA data
    dbscan_labels = dbscan.labels_
    n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
    n_noise = list(dbscan_labels).count(-1)
    
    print("--- DBSCAN Clustering ---")
    print(f"Estimated number of clusters: {n_clusters_dbscan}")
    print(f"Estimated number of noise points: {n_noise}")
    
    # Calculate silhouette score only if more than 1 cluster is found (and not all points are noise)
    if n_clusters_dbscan > 1:
        dbscan_silhouette = silhouette_score(X_pca, dbscan_labels)
        print(f"Silhouette Score: {dbscan_silhouette:.4f}")
    else:
        dbscan_silhouette = -1 # Or some indicator that it couldn't be calculated
        print("Silhouette Score cannot be calculated (less than 2 clusters found).")
        
    # Plot DBSCAN clusters
    plot_clusters(X_pca, dbscan_labels, None, 'DBSCAN Clustering Results (on PCA data)') # No centroids for DBSCAN

else:
    print("PCA data not available. Skipping DBSCAN.")

### 6.4 Clustering Model Comparison

In [None]:
# Compare K-Means and DBSCAN using Silhouette Score
if 'kmeans_silhouette' in locals() and 'dbscan_silhouette' in locals():
    clustering_comparison = {
        'K-Means': {'Silhouette': kmeans_silhouette},
        'DBSCAN': {'Silhouette': dbscan_silhouette if n_clusters_dbscan > 1 else np.nan} # Use NaN if score wasn't calculated
    }
    plot_model_comparison(clustering_comparison, 'Clustering Model Comparison (Higher Silhouette is Better)', 
                          metric='Silhouette', higher_is_better=True)
else:
    print("Clustering results not available for comparison.")

### 6.5 Singular Value Decomposition (SVD) for Compression (Demonstration)

Since the ESG dataset isn't image data, we'll demonstrate SVD on a sample matrix derived from the data (e.g., the correlation matrix or a subset of numerical features). The primary use case (image compression) requires image data.

In [None]:
if 'X_unsupervised' in locals():
    print("--- Singular Value Decomposition (SVD) Demonstration ---")
    # Example: Apply SVD to a subset of the data (e.g., first 100 samples, first 10 features)
    X_subset = X_unsupervised.iloc[:100, :10].values
    print(f"Applying SVD to a subset matrix of shape: {X_subset.shape}")
    
    svd_comp = SVDCompression(n_components=None) # Keep all components initially
    svd_comp.fit(X_subset) # Fit SVD
    
    # Plot explained variance ratio
    plt.figure(figsize=(10, 6))
    plt.plot(np.cumsum(svd_comp.explained_variance_ratio_))
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    plt.title('SVD: Explained Variance by Number of Components')
    plt.grid(True)
    plt.show()

    # Reconstruct using a smaller number of components (e.g., 5)
    n_recon_components = 5
    X_reconstructed = svd_comp.transform(n_components=n_recon_components)
    reconstruction_error = np.linalg.norm(X_subset - X_reconstructed) / np.linalg.norm(X_subset)
    print(f"\nReconstructing using {n_recon_components} components.")
    print(f"Shape of reconstructed data: {X_reconstructed.shape}")
    print(f"Relative Reconstruction Error: {reconstruction_error:.4f}")
    
    # Note: Visualizing the reconstruction isn't meaningful here as it's not image data.
    # For actual image compression, you would load an image, fit SVD, 
    # transform with fewer components, and display the reconstructed image.

else:
    print("Unsupervised data not available. Skipping SVD demonstration.")

## 7. Conclusion

This notebook demonstrated the application of various supervised and unsupervised learning algorithms on the ESG and Financial Performance dataset. 

**Key Steps:**
* Data was loaded, explored, and preprocessed (handling missing values, encoding, scaling).
* **Supervised Learning:**
    * Regression models (Linear Regression, Decision Tree, KNN, Random Forest, Gradient Boosting) were trained to predict `ProfitMargin`. Performance varied, with ensemble methods (Random Forest, Gradient Boosting) often showing better R2 scores, indicating a better fit to the data than simpler models, though potentially overfitting if not tuned.
    * Classification models (Perceptron, Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Neural Network) were trained to predict a derived `ESG_Category`. Again, ensemble methods and KNN generally performed well in terms of Accuracy and F1-score, suggesting complex relationships influence ESG standing. The simple Perceptron likely struggled with non-linearly separable data.
* **Unsupervised Learning:**
    * PCA successfully reduced the dimensionality while retaining a significant portion of the variance, enabling 2D visualization.
    * K-Means and DBSCAN were applied to the PCA-reduced data, identifying potential clusters of companies. K-Means forced data into k clusters, while DBSCAN identified density-based clusters and marked some points as noise, which might be more realistic for this type of data. Silhouette scores provided a quantitative comparison, but visual inspection of clusters is also important.
    * SVD was demonstrated on a matrix subset, showing how variance is captured by singular components, illustrating its potential for dimensionality reduction or (in its primary application) image compression.

**Overall:** The analysis showcased the implementation and comparative performance of fundamental machine learning algorithms. The results suggest that ESG scores and financial metrics have complex interdependencies that are better captured by non-linear or ensemble models. Further work could involve more rigorous hyperparameter tuning, feature engineering, and exploring different classification/regression targets within the dataset.