# Random Forest Demo

This notebook demonstrates the `rand_forest()` model specification in py-parsnip.

Random forests support both regression and classification modes and provide comprehensive outputs including:
- Feature importances (instead of coefficients)
- Performance metrics by split (train/test)
- Observation-level predictions and residuals

In [1]:
import pandas as pd
import numpy as np
from py_parsnip import rand_forest

# Set random seed for reproducibility
np.random.seed(42)

## Example 1: Regression with Random Forest

We'll predict house prices based on size, bedrooms, and location.

In [2]:
# Generate synthetic housing data
n = 100
data = pd.DataFrame({
    'size_sqft': np.random.uniform(800, 3000, n),
    'bedrooms': np.random.randint(1, 6, n),
    'age_years': np.random.uniform(0, 50, n),
})

# Create price with some non-linear relationships
data['price'] = (
    150 * data['size_sqft'] +
    20000 * data['bedrooms'] -
    500 * data['age_years'] +
    0.05 * data['size_sqft'] ** 2 +  # Non-linear term
    np.random.normal(0, 20000, n)
)

print(data.head())
print(f"\nData shape: {data.shape}")

     size_sqft  bedrooms  age_years          price
0  1623.988261         1  36.760806  345873.588459
1  2891.571474         4  10.453581  966575.171871
2  2410.386672         5  27.072399  714517.362122
3  2117.048665         4  34.789220  628285.051234
4  1143.241009         5  11.427501  334119.075358

Data shape: (100, 4)


In [3]:
# Split into train and test
train = data.iloc[:80].copy()
test = data.iloc[80:].copy()

print(f"Train: {train.shape}, Test: {test.shape}")

Train: (80, 4), Test: (20, 4)


### Create and Fit Random Forest Regression Model

In [4]:
# Create specification
spec = rand_forest(
    trees=500,      # Number of trees
    mtry=2,        # Features to consider at each split
    min_n=5        # Minimum samples to split
).set_mode("regression")

print(spec)

ModelSpec(model_type='rand_forest', engine='sklearn', mode='regression', args={'mtry': 2, 'trees': 500, 'min_n': 5})


In [5]:
# Fit the model
fit = spec.fit(train, "price ~ size_sqft + bedrooms + age_years")
print("Model fitted successfully!")

Model fitted successfully!


### Make Predictions

In [6]:
# Predict on test data
predictions = fit.predict(test, type="numeric")
print(predictions.head(10))

           .pred
0  813175.549853
1  621414.524766
2  373755.287562
3  235502.479796
4  367963.172211
5  368717.258445
6  763234.290133
7  674574.659366
8  827469.867630
9  445940.345067


### Evaluate on Test Data

In [7]:
# Evaluate stores test predictions for comprehensive outputs
fit = fit.evaluate(test)
print("Model evaluated on test data!")

Model evaluated on test data!


### Extract Comprehensive Outputs

In [None]:
# Extract the three standard DataFrames
outputs, coefficients, stats = fit.extract_outputs()


OUTPUTS (Observation-level results)
         actuals         fitted       forecast  split        model  \
0  345873.588459  377584.013185  377584.013185  train  rand_forest   
1  966575.171871  912173.847623  912173.847623  train  rand_forest   
2  714517.362122  723801.675977  723801.675977  train  rand_forest   
3  628285.051234  640410.891421  640410.891421  train  rand_forest   
4  334119.075358  333183.793979  333183.793979  train  rand_forest   
5  335266.399745  345420.061154  345420.061154  train  rand_forest   
6  224740.927729  235365.203161  235365.203161  train  rand_forest   
7  821326.170665  815734.457159  815734.457159  train  rand_forest   
8  636415.406888  638234.931631  638234.931631  train  rand_forest   
9  696490.252178  674533.753092  674533.753092  train  rand_forest   

  model_group_name   group      residuals  
0                   global  345873.588459  
1                   global  966575.171871  
2                   global  714517.362122  
3                

In [21]:
outputs

Unnamed: 0,actuals,fitted,forecast,split,model,model_group_name,group,residuals
0,345873.588459,377584.013185,377584.013185,train,rand_forest,,global,345873.588459
1,966575.171871,912173.847623,912173.847623,train,rand_forest,,global,966575.171871
2,714517.362122,723801.675977,723801.675977,train,rand_forest,,global,714517.362122
3,628285.051234,640410.891421,640410.891421,train,rand_forest,,global,628285.051234
4,334119.075358,333183.793979,333183.793979,train,rand_forest,,global,334119.075358
...,...,...,...,...,...,...,...,...
95,484851.532297,,476867.653433,test,rand_forest,,global,7983.878864
96,519843.730359,,481707.167622,test,rand_forest,,global,38136.562737
97,436150.452663,,421334.566805,test,rand_forest,,global,14815.885858
98,188252.589867,,247853.225288,test,rand_forest,,global,-59600.635420


In [22]:
coefficients

Unnamed: 0,variable,coefficient,std_error,t_stat,p_value,ci_0.025,ci_0.975,vif,model,model_group_name,group
0,size_sqft,0.893887,,,,,,,rand_forest,,global
1,bedrooms,0.040431,,,,,,,rand_forest,,global
2,age_years,0.065681,,,,,,,rand_forest,,global


In [23]:
stats

Unnamed: 0,metric,value,split,model,model_group_name,group
0,rmse,19186.026304,train,rand_forest,,global
1,mae,15351.317094,train,rand_forest,,global
2,mape,4.006206,train,rand_forest,,global
3,smape,3.897623,train,rand_forest,,global
4,r_squared,0.993356,train,rand_forest,,global
5,mda,98.734177,train,rand_forest,,global
6,adj_r_squared,0.993094,train,rand_forest,,global
7,rmse,42600.152198,test,rand_forest,,global
8,mae,35817.167344,test,rand_forest,,global
9,mape,9.180288,test,rand_forest,,global


In [None]:

print("=" * 60)
print("OUTPUTS (Observation-level results)")
print("=" * 60)
print(outputs.head(10))
print(f"\nShape: {outputs.shape}")

In [9]:
print("\n" + "=" * 60)
print("COEFFICIENTS (Feature Importances for Random Forest)")
print("=" * 60)
print(coefficients)
print("\nNote: Random forests use feature importances instead of coefficients.")
print("Higher importance = more influential feature.")


COEFFICIENTS (Feature Importances for Random Forest)
    variable  coefficient  std_error  t_stat  p_value  ci_0.025  ci_0.975  \
0  size_sqft     0.893887        NaN     NaN      NaN       NaN       NaN   
1   bedrooms     0.040431        NaN     NaN      NaN       NaN       NaN   
2  age_years     0.065681        NaN     NaN      NaN       NaN       NaN   

   vif        model model_group_name   group  
0  NaN  rand_forest                   global  
1  NaN  rand_forest                   global  
2  NaN  rand_forest                   global  

Note: Random forests use feature importances instead of coefficients.
Higher importance = more influential feature.


In [10]:
print("\n" + "=" * 60)
print("STATS (Model-level metrics by split)")
print("=" * 60)
print(stats[stats['metric'].isin(['rmse', 'mae', 'r_squared', 'mape'])].pivot(
    index='metric', columns='split', values='value'
))
print("\nCompare train vs test metrics to assess overfitting.")


STATS (Model-level metrics by split)
split              test         train
metric                               
mae        35817.167344  15351.317094
mape           9.180288      4.006206
r_squared      0.953568      0.993356
rmse       42600.152198  19186.026304

Compare train vs test metrics to assess overfitting.


## Example 2: Classification with Random Forest

We'll predict iris species based on sepal and petal measurements.

In [11]:
# Generate synthetic iris-like data
np.random.seed(42)
n_per_class = 30

# Generate three distinct clusters (species)
setosa = pd.DataFrame({
    'species': 'setosa',
    'sepal_length': np.random.normal(5.0, 0.4, n_per_class),
    'sepal_width': np.random.normal(3.4, 0.3, n_per_class),
    'petal_length': np.random.normal(1.5, 0.2, n_per_class),
})

versicolor = pd.DataFrame({
    'species': 'versicolor',
    'sepal_length': np.random.normal(6.0, 0.5, n_per_class),
    'sepal_width': np.random.normal(2.8, 0.3, n_per_class),
    'petal_length': np.random.normal(4.3, 0.4, n_per_class),
})

virginica = pd.DataFrame({
    'species': 'virginica',
    'sepal_length': np.random.normal(6.5, 0.6, n_per_class),
    'sepal_width': np.random.normal(3.0, 0.3, n_per_class),
    'petal_length': np.random.normal(5.5, 0.5, n_per_class),
})

iris_data = pd.concat([setosa, versicolor, virginica], ignore_index=True)
iris_data = iris_data.sample(frac=1).reset_index(drop=True)  # Shuffle

print(iris_data.head(10))
print(f"\nClass distribution:")
print(iris_data['species'].value_counts())

      species  sepal_length  sepal_width  petal_length
0      setosa      5.150279     3.307236      1.565750
1      setosa      4.234688     3.309669      1.812929
2   virginica      6.808272     3.055936      4.784929
3  versicolor      6.148060     2.502839      5.046310
4  versicolor      5.982644     2.428915      4.305201
5      setosa      4.906345     3.033747      1.771248
6   virginica      6.809029     2.801464      5.279978
7  versicolor      6.375967     2.889095      5.388068
8   virginica      5.937305     3.190176      4.876108
9      setosa      4.310033     2.956443      0.976051

Class distribution:
species
setosa        30
virginica     30
versicolor    30
Name: count, dtype: int64


In [12]:
# Split into train and test
train_clf = iris_data.iloc[:70].copy()
test_clf = iris_data.iloc[70:].copy()

print(f"Train: {train_clf.shape}, Test: {test_clf.shape}")
print(f"\nTrain class distribution:")
print(train_clf['species'].value_counts())
print(f"\nTest class distribution:")
print(test_clf['species'].value_counts())

Train: (70, 4), Test: (20, 4)

Train class distribution:
species
virginica     25
versicolor    23
setosa        22
Name: count, dtype: int64

Test class distribution:
species
setosa        8
versicolor    7
virginica     5
Name: count, dtype: int64


### Create and Fit Random Forest Classification Model

In [13]:
# Create specification for classification
spec_clf = rand_forest(
    trees=300,
    mtry=2,
    min_n=2
).set_mode("classification")

print(spec_clf)

ModelSpec(model_type='rand_forest', engine='sklearn', mode='classification', args={'mtry': 2, 'trees': 300, 'min_n': 2})


In [14]:
# Fit the model
fit_clf = spec_clf.fit(train_clf, "species ~ sepal_length + sepal_width + petal_length")
print("Classification model fitted successfully!")

Classification model fitted successfully!


### Make Class Predictions

In [15]:
# Predict class labels
pred_class = fit_clf.predict(test_clf, type="class")
print("Class predictions:")
print(pred_class.head(10))

Class predictions:
  .pred_class
0  versicolor
1  versicolor
2   virginica
3   virginica
4  versicolor
5  versicolor
6  versicolor
7      setosa
8  versicolor
9  versicolor


### Predict Class Probabilities

In [16]:
# Predict class probabilities
pred_prob = fit_clf.predict(test_clf, type="prob")
print("Class probabilities:")
print(pred_prob.head(10))

Class probabilities:
   .pred_setosa  .pred_versicolor  .pred_virginica
0      0.000000          0.713333         0.286667
1      0.253333          0.743333         0.003333
2      0.000000          0.016667         0.983333
3      0.000000          0.383333         0.616667
4      0.036667          0.920000         0.043333
5      0.000000          1.000000         0.000000
6      0.000000          0.996667         0.003333
7      1.000000          0.000000         0.000000
8      0.000000          0.940000         0.060000
9      0.000000          0.946667         0.053333


### Evaluate and Extract Comprehensive Outputs

In [17]:
# Evaluate on test data
fit_clf = fit_clf.evaluate(test_clf)

# Extract outputs
outputs_clf, coefficients_clf, stats_clf = fit_clf.extract_outputs()

print("=" * 60)
print("OUTPUTS (Classification)")
print("=" * 60)
print(outputs_clf.head(15))
print(f"\nNote: 'fitted' and 'forecast' show predicted class labels")

OUTPUTS (Classification)
       actuals      fitted    forecast  split        model model_group_name  \
0       setosa      setosa      setosa  train  rand_forest                    
1       setosa      setosa      setosa  train  rand_forest                    
2    virginica   virginica   virginica  train  rand_forest                    
3   versicolor  versicolor  versicolor  train  rand_forest                    
4   versicolor  versicolor  versicolor  train  rand_forest                    
5       setosa      setosa      setosa  train  rand_forest                    
6    virginica   virginica   virginica  train  rand_forest                    
7   versicolor  versicolor  versicolor  train  rand_forest                    
8    virginica   virginica   virginica  train  rand_forest                    
9       setosa      setosa      setosa  train  rand_forest                    
10  versicolor  versicolor  versicolor  train  rand_forest                    
11      setosa      setosa 

In [18]:
print("\n" + "=" * 60)
print("FEATURE IMPORTANCES (Classification)")
print("=" * 60)
print(coefficients_clf)
print("\nWhich feature is most important for classifying species?")


FEATURE IMPORTANCES (Classification)
       variable  coefficient  std_error  t_stat  p_value  ci_0.025  ci_0.975  \
0  sepal_length     0.256887        NaN     NaN      NaN       NaN       NaN   
1   sepal_width     0.063072        NaN     NaN      NaN       NaN       NaN   
2  petal_length     0.680041        NaN     NaN      NaN       NaN       NaN   

   vif        model model_group_name   group  
0  NaN  rand_forest                   global  
1  NaN  rand_forest                   global  
2  NaN  rand_forest                   global  

Which feature is most important for classifying species?


### Calculate Accuracy

In [19]:
# Calculate accuracy from outputs
test_outputs = outputs_clf[outputs_clf['split'] == 'test'].copy()
accuracy = (test_outputs['actuals'] == test_outputs['forecast']).mean()
print(f"Test Accuracy: {accuracy:.2%}")

# Calculate train accuracy
train_outputs = outputs_clf[outputs_clf['split'] == 'train'].copy()
train_accuracy = (train_outputs['actuals'] == train_outputs['forecast']).mean()
print(f"Train Accuracy: {train_accuracy:.2%}")

print(f"\nOverfitting check: Train - Test = {(train_accuracy - accuracy):.2%}")

Test Accuracy: 95.00%
Train Accuracy: 100.00%

Overfitting check: Train - Test = 5.00%


## Example 3: Comparing Different Tree Counts

In [20]:
# Compare models with different number of trees
results = []

for n_trees in [50, 100, 300, 500]:
    spec_temp = rand_forest(trees=n_trees).set_mode("regression")
    fit_temp = spec_temp.fit(train, "price ~ size_sqft + bedrooms + age_years")
    fit_temp = fit_temp.evaluate(test)
    
    _, _, stats_temp = fit_temp.extract_outputs()
    
    # Get test RMSE
    test_rmse = stats_temp[
        (stats_temp['metric'] == 'rmse') & 
        (stats_temp['split'] == 'test')
    ]['value'].values[0]
    
    results.append({
        'n_trees': n_trees,
        'test_rmse': test_rmse
    })

results_df = pd.DataFrame(results)
print("\nEffect of number of trees on Test RMSE:")
print(results_df)
print("\nNote: More trees generally improves performance but with diminishing returns.")


Effect of number of trees on Test RMSE:
   n_trees     test_rmse
0       50  33659.733784
1      100  32548.619338
2      300  32707.751684
3      500  32463.661779

Note: More trees generally improves performance but with diminishing returns.


## Summary

This demo showed:

1. **Regression with Random Forest**:
   - Create spec with `rand_forest().set_mode("regression")`
   - Customize `trees`, `mtry`, `min_n` parameters
   - Extract feature importances instead of coefficients

2. **Classification with Random Forest**:
   - Set mode to `"classification"`
   - Predict class labels (`type="class"`) or probabilities (`type="prob"`)
   - Handle multi-class problems naturally

3. **Comprehensive Outputs**:
   - `outputs`: Observation-level predictions and residuals
   - `coefficients`: Feature importances (unique to tree-based models)
   - `stats`: Model metrics by split (train/test)

4. **Model Evaluation**:
   - Use `evaluate()` to store test predictions
   - Compare train vs test metrics to detect overfitting
   - Tune hyperparameters like `trees` based on test performance