# Random Forest regression on `synthetic_faults_dataset.csv`

This notebook runs Random Forest regressors to predict physical source parameters
(`dip`, `length`, `slip`, `opening`) from different feature sets:

- displacement + LOS features
- displacement-only features
- LOS-only features

For each combination (feature set, target parameter), we compute R², RMSE and MAE
and collect everything in a summary table.

In [None]:
from dataset_generator import dataset_generator
dataset_generator()

In [None]:
import pandas as pd

from random_forest import run_random_forest_regression

# Load dataset generated by `dataset_generator.py`
df = pd.read_csv('data/synthetic_faults_dataset.csv')
df.head()

## Targets and feature sets

We will try to recover the following physical parameters from the synthetic data:

- `phys_dip`
- `phys_length`
- `phys_slip`
- `phys_opening`

We define three feature sets:

1. **disp+LOS** – all non-physical features (displacement + LOS)
2. **disp_only** – displacement-related summary features
3. **los_only** – simple LOS statistics


In [None]:
# Physical parameters to regress
target_params = [
    'phys_dip',
    'phys_length',
    'phys_slip',
    'phys_opening',
]

# 1) displacement + LOS: all non-physical columns except the label
feature_cols_all = [
    c for c in df.columns
    if not c.startswith('phys_') and c != 'label'
]

# 2) displacement-only features
disp_feature_cols = [
    'Ux_max', 'Uy_max', 'Uz_max',
    'Ux_min', 'Uy_min', 'Uz_min',
    'Ux_range', 'Uy_range', 'Uz_range',
    'Ux_Uz_ratio', 'Uy_Uz_ratio',
    'Uz_energy',
]

# 3) LOS-only features
los_feature_cols = ['LOS_max', 'LOS_std']

feature_sets = {
    'disp+LOS': feature_cols_all,
    'disp_only': disp_feature_cols,
    'los_only': los_feature_cols,
}

feature_sets

## Helper: run regressions for all targets and feature sets

We now define a small helper that loops over all feature sets and all target
parameters, calls `run_random_forest_regression`, and collects the metrics
in a single summary table.

In [None]:
import pandas as pd

def run_all_regressions(df, feature_sets, target_params):
    summary_rows = []
    results_by_feature_set = {}

    for fs_name, feat_cols in feature_sets.items():
        print("\n==============================")
        print(f"Feature set: {fs_name}")
        print(f"Using features: {feat_cols}\n")

        X = df[feat_cols]
        results_by_feature_set[fs_name] = {}

        for param in target_params:
            y = df[param]

            rf, res = run_random_forest_regression(
                X, y,
                target_name=param,
                experiment_name=f"RF regression – {fs_name}",
                plot_scatter=True,  # set to True if you want all scatter plots,
                plot_example_tree=True

            )

            results_by_feature_set[fs_name][param] = res

            summary_rows.append({
                'feature_set': fs_name,
                'target': param,
                'r2': res['r2'],
                'rmse': res['rmse'],
                'mae': res['mae'],
            })

    summary_df = pd.DataFrame(summary_rows)
    return summary_df, results_by_feature_set

## Run all regressions

We now run all 18 experiments (3 feature sets × 6 target parameters)
with a single call to the helper function.

In [None]:
summary_df, results_by_feature_set = run_all_regressions(
    df, feature_sets, target_params
)
summary_df.sort_values(['target', 'feature_set'])

## Comparison tables

To better compare feature sets, we can pivot the summary table and look at R²
for each target and feature set.

In [None]:
# Pivot on R²
pivot_r2 = summary_df.pivot(index='target', columns='feature_set', values='r2')
pivot_r2

In [None]:
# If you also want to inspect RMSE or MAE, you can pivot them as well
pivot_rmse = summary_df.pivot(index='target', columns='feature_set', values='rmse')
pivot_mae = summary_df.pivot(index='target', columns='feature_set', values='mae')

pivot_rmse, pivot_mae