# Model Evaluation: Random Forest and XGBoost

Following the development of new features and the implementation of a backtesting (walk-forward), we proceed to evaluate two ensemble learning models: **Random Forest** and **XGBoost**.

These algorithms were chosen based on their strong performance in handling non-linear relationships, robustness to noise, and effectiveness in capturing complex interactions between features. Research was also done on which we found that different financial prediction models tend to use Random Forest to predict market swings such as up and down.

The objective of this evaluation is to identify which model provides more consistent and reliable predictions for price movement classification tasks.

This section will detail the model configurations, training methodology, evaluation metrics, and performance comparison across multiple time horizons.


In [13]:
## loading data
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, accuracy_score, recall_score
import time
import matplotlib.pyplot as plt

df = pd.read_csv("../data/2019-2023_stock_with_features_dif_tickers.csv")
df['date'] = pd.to_datetime(df['date'])

# Pre-selected features for modeling
features = ['quantity', 'volume', 'ibovespa_close', 'day_of_week', 'price_range', 'volume_per_quantity',
            'rolling_std_5', 'rolling_return_5', 'momentum_5', 'rolling_volume_5', 'Trend_2',
            'Close_Ratio_5', 'Trend_5', 'Close_Ratio_55', 'Trend_55', 'Close_Ratio_220']
df

Unnamed: 0,date,ticker,quantity,volume,ibovespa_close,day_of_week,price_range,volume_per_quantity,tomorrow,target,rolling_std_5,rolling_return_5,rolling_volume_5,momentum_5,Trend_2,Close_Ratio_5,Trend_5,Close_Ratio_55,Trend_55,Close_Ratio_220
0,2019-11-18,ABEV3,9445200,165711710.0,106347.0,1,0.26,17.544542,17.67,1,0.091815,0.001866,246297201.6,0.008848,2.0,1.005843,3.0,0.946900,27.0,0.971916
1,2019-11-18,B3SA3,10101700,501185409.0,106347.0,1,1.54,49.613967,47.94,0,0.542015,0.007892,394530554.4,-0.008612,1.0,0.990027,3.0,1.066652,34.0,1.284061
2,2019-11-18,BBAS3,15585600,724962222.0,106347.0,1,1.15,46.514874,45.79,0,0.494803,-0.001119,641216796.2,-0.014652,1.0,0.989588,2.0,0.987510,26.0,0.933396
3,2019-11-18,BBDC4,18808900,625495829.0,106347.0,1,0.91,33.255312,33.07,0,0.394373,-0.005110,597208475.0,-0.020286,0.0,0.986188,0.0,0.970354,27.0,0.886461
4,2019-11-18,BRAP4,1248000,40441499.0,106347.0,1,0.78,32.405047,32.69,1,0.455379,-0.001457,46990409.6,-0.002218,2.0,1.002414,3.0,1.032162,32.0,1.047033
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28982,2023-11-17,ITUB4,24575400,748326813.0,125062.0,5,0.47,30.450239,,0,0.419786,0.005742,882559210.0,0.026220,2.0,1.017918,4.0,1.101164,28.0,1.139022
28983,2023-11-17,MGLU3,297371000,659930643.0,125062.0,5,0.21,2.219217,,0,0.199675,0.061008,503911766.0,0.207835,2.0,1.144330,4.0,1.089887,19.0,0.709667
28984,2023-11-17,GGBR4,12991400,319205611.0,125062.0,5,0.45,24.570532,,0,0.610066,0.011093,329996779.2,0.034788,2.0,1.019381,4.0,1.034852,26.0,0.936003
28985,2023-11-17,WEGE3,8938000,296624415.0,125062.0,5,0.82,33.186889,,0,0.470606,0.004170,266163011.2,0.006415,1.0,1.001264,3.0,0.959653,22.0,0.883089


In [14]:
# defining the Random Forest model
model = RandomForestClassifier(n_estimators=200, min_samples_split=50, n_jobs=-1, random_state=1)

In [1]:
# helpers for predictions
def evaluate_preds(y_true, y_pred, label="Evaluation"):
    """
    Compute and print Accuracy, Precision, Recall metrics.
    """
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    print(f"\n=== {label} Metrics ===")
    print("Accuracy:", round(acc, 4))
    print("Precision:", round(prec, 4))
    print("Recall:", round(rec, 4))
    return acc, prec, rec

def predict(train, test, predictors, model, threshold=0.6):
    """
    Train model on train data, predict probabilities on test data,
    then convert probabilities to binary predictions with given threshold.
    """
    model.fit(train[predictors], train['target'])
    probs = model.predict_proba(test[predictors])[:, 1]
    preds = (probs >= threshold).astype(int)
    return pd.DataFrame({'target': test['target'], 'Predictions': preds}, index=test.index)

def backtest_stepwise(df, model, features, start=2000, step=1500, threshold=0.6):
    """
    Stepwise backtesting: train on increasing dataset slices, test on next slice.
    Calculates train/test accuracy and overfit ratio at each step.
    """
    all_preds = []
    overfit_ratios = []
    total_steps = (df.shape[0] - start) // step + 1
    start_time = time.time()

    for i, idx in enumerate(range(start, df.shape[0], step), 1):
        train = df.iloc[:idx].copy()
        test = df.iloc[idx:idx + step].copy()

        # Make predictions on test data
        preds = predict(train, test, features, model, threshold)
        all_preds.append(preds)

        # Evaluate train and test performance
        train_preds = model.predict(train[features])
        train_acc, _, _ = evaluate_preds(train['target'], train_preds, label=f"Train Step {i}")

        test_acc, _, _ = evaluate_preds(test['target'], preds['Predictions'], label=f"Test Step {i}")

        # Calculate and print overfit ratio (train_acc / test_acc)
        if test_acc > 0:
            overfit_ratio = train_acc / test_acc
            overfit_ratios.append(overfit_ratio)
            print(f"Overfit Ratio (Train/Test): {overfit_ratio:.3f}")
        else:
            print("Test accuracy is 0, skipping overfit ratio.")
            overfit_ratios.append(float('nan'))

        # Show elapsed time and estimated time remaining
        elapsed = time.time() - start_time
        avg_per_step = elapsed / i
        remaining_steps = total_steps - i
        eta = remaining_steps * avg_per_step
        print(f"Step {i}/{total_steps} done. Elapsed: {elapsed:.1f}s, ETA: {eta / 60:.1f} min\n")

    # Final average overfit ratio across all steps
    valid_ratios = [r for r in overfit_ratios if not pd.isna(r)]
    if valid_ratios:
        avg_ratio = sum(valid_ratios) / len(valid_ratios)
        print(f"\nFinal Average Overfit Ratio across steps: {avg_ratio:.3f}")
    else:
        print("\nNo valid overfit ratios calculated.")

    return pd.concat(all_preds)


In [2]:
## running the model
predictions = backtest_stepwise(df, model, features)

NameError: name 'df' is not defined

In [17]:
# evaluate the model and analyse the predictions
evaluate_preds(predictions['target'], predictions['Predictions'], label="Overall Backtest")

print("\n=== Prediction Distribution ===")
print(predictions['Predictions'].value_counts())

print("\n=== Actual Distribution ===")
print(predictions['target'].value_counts(normalize=True))



=== Overall Backtest Metrics ===
Accuracy: 0.569
Precision: 0.6441
Recall: 0.2733

=== Prediction Distribution ===
Predictions
0    21365
1     5622
Name: count, dtype: int64

=== Actual Distribution ===
target
0    0.508986
1    0.491014
Name: proportion, dtype: float64


# Backtest Performance Analysis

## Overall Backtest Metrics
- **Accuracy:** 0.569
- **Precision:** 0.6441
- **Recall:** 0.2733

The overall accuracy of the model is moderate at around 57%, meaning the model correctly predicts the direction of the stock price movement more than half the time. The precision is reasonably good at 64%, indicating that when the model predicts an upward movement (class 1), it is correct about two-thirds of the time. However, recall is quite low at 27%, showing that the model only identifies a small fraction of the actual upward moves.

---

## Prediction vs Actual Distribution
| Label       | Count / Proportion |
|-------------|--------------------|
| Predicted 0 | 21,365             |
| Predicted 1 | 5,622              |
| Actual 0    | 50.9%              |
| Actual 1    | 49.1%              |

The model tends to predict the negative class (0) much more often than the positive class (1). While the actual classes are almost evenly balanced (about 51% negative, 49% positive), the model is biased towards predicting the negative class. This imbalance in predictions partly explains the low recall—the model misses a large portion of the positive cases.

---

## Overfitting and Model Selection

During the stepwise backtest, the **average overfitting ratio** (train accuracy divided by test accuracy) was approximately **1.57**, which indicates that the model performs about 57% better on the training data than on unseen test data. This gap reveals a moderate degree of overfitting, where the Random Forest model learns some patterns that do not generalize well.

The relatively high overfit ratio, coupled with moderate recall and imbalanced prediction distribution, suggests the current Random Forest model struggles to generalize consistently across all steps.

---

## Next Steps

To address overfitting and improve recall and overall predictive power, the next phase involves switching to **XGBoost**, which allows more granular hyperparameter tuning to control model complexity and better handle class imbalance.


For XGBoost, Try different probability thresholds to balance precision vs recall, focus on higher precision to avoid false trades