
# Machine Learning Regression Analysis
This notebook evaluates various regression models to predict a target variable (`T-MoCA`) based on feature importance analysis. It includes:
- Data preprocessing
- Feature selection based on permutation importance
- Cross-validation to evaluate performance (R-squared and RMSE)
- Visualization of results for insights.


In [None]:

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import RobustScaler
from sklearn.inspection import permutation_importance
from sklearn.model_selection import LeaveOneGroupOut, cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score



## Load and Merge Data
Two datasets are loaded: `digimoca_results_totales.csv` and `datos_participantes_totales.csv`. These datasets are merged to create a unified dataset for analysis.


In [None]:

# Load datasets
dataset = pd.read_csv("digimoca_results_totales.csv")
participants = pd.read_csv("datos_participantes_totales.csv")

# Merge datasets
merged = pd.merge(participants, dataset, on='id')

# Display the first few rows of the merged dataset
merged.head()



## Data Preprocessing
- **Feature Selection**: Drop irrelevant or redundant features.
- **Feature Scaling**: Apply robust scaling to normalize the feature space.


In [None]:

# Define features and target variable
ids = merged['id']
X = merged.drop(['GDS', 'id', 'gender', 'age', 'studies', 'lawton', 'MFE', 'T-MoCA', 'timestamp'], axis=1)
y = merged[['T-MoCA']].values.ravel()

# Apply robust scaling
scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Display scaled features
X.head()



## Model Selection and Evaluation
Some of the linear models are evaluated, but feel free to include any other in the estimators list.
For it, create an instance of the Scikit-learn model you want to use, and give it a name to identify it.
Evaluation is performed using:
- Permutation importance to determine feature relevance.
- Leave-One-Group-Out (LOGO) Cross-Validation to assess model performance.


In [None]:

# Define models
estimators = [
    {'model': LinearRegression(), 'name': 'lr'},
    {'model': Ridge(), 'name': 'ri'},
    {'model': BayesianRidge(), 'name': 'bar'},
    {'model': PLSRegression(n_components=1), 'name': 'plsr'}
]



## Evaluate Models and Feature Importance
For each model:
1. Fit the model.
2. Compute permutation importance to rank features.
3. Calculate R-squared for varying numbers of top features.
4. Plot feature importance and R-squared.


In [None]:

# Function to plot feature importances and R2 scores
def plot_feature_importances(fis, r2_list, name):
    fig, ax1 = plt.subplots()
    color = 'tab:blue'
    ax1.set_xlabel('Features')
    ax1.set_ylabel('AVG Importances', color=color)
    ax1.bar(range(1, len(r2_list) + 1), fis, color=color)
    ax1.tick_params(axis='y', labelcolor=color)

    ax2 = ax1.twinx()
    color = 'tab:red'
    ax2.set_ylabel('R2 score', color=color)
    ax2.plot(range(1, len(r2_list) + 1), r2_list, color=color)
    ax2.tick_params(axis='y', labelcolor=color)

    fig.tight_layout()
    plt.show()

# Evaluate models
for model in estimators:
    model['model'].fit(X, y)
    result = permutation_importance(model['model'], X, y, scoring='r2', n_repeats=100, n_jobs=-1)
    avg_fi = pd.Series(result.importances_mean, index=X.columns.tolist())

    r2_list = []
    for i in range(1, X.shape[1] + 1):
        X_best = X[avg_fi.nlargest(i).index]
        y_pred = cross_val_predict(model['model'], X_best, y, cv=LeaveOneGroupOut().split(merged, groups=ids), n_jobs=-1)
        r2_list.append(r2_score(y, y_pred))

    # Display R2 scores
    print(f"{model['name']}: Optimal features: {r2_list.index(max(r2_list)) + 1}, Max R2: {max(r2_list)}")
    plot_feature_importances(avg_fi.sort_values(ascending=False).values, r2_list, model['name'])



## Scatter Plot of Predictions
Generate scatter plots for predicted vs true values for all models.

**IMPORTANT: make sure to include 4 models at a time in the estimators list for this section** (or modify the `scatter_plot()` function in order to adapt).

In [None]:

# Function to plot scatter plots
def scatter_plot(y, y_pred_list, estimators):
    fig, axs = plt.subplots(2, 2, figsize=(9, 7))
    axs = np.ravel(axs)
    for ax, est, y_pred in zip(axs, estimators, y_pred_list):
        name = type(est['model']).__name__
        r2 = r2_score(y, y_pred)
        rmse = np.sqrt(mean_squared_error(y, y_pred))
        ax.plot([y.min(), y.max()], [y.min(), y.max()], "--r", linewidth=2)
        ax.scatter(y, y_pred, alpha=0.2)
        ax.set_title(f"{name} (R2: {r2:.2f}, RMSE: {rmse:.2f})")
    plt.tight_layout()
    plt.show()

# Generate predictions and scatter plots
y_pred_list = []
for model in estimators:
    best_features = pd.DataFrame(avg_fi.nlargest(r2_list.index(max(r2_list)) + 1).index)
    X_best = X[best_features[0]]
    y_pred = cross_val_predict(model['model'], X_best, y, cv=LeaveOneGroupOut().split(merged, groups=ids), n_jobs=-1)
    y_pred_list.append(y_pred)

scatter_plot(y, y_pred_list, estimators)
